Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ARTIFICIAL NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2021/118651
Kind Code:
A1
Abstract:
An artificial neural network (ANN) generates an expanded matrix that represents an output of a layer of the ANN, such as the output layer. Values in each row are grouped with respect to network parameters in at least one previous layer. The ANN updates values in at least one column of the expanded matrix according to network parameter updates, and the values in each row are summed to produce an updated vector. An error or a total cost can be computed from the updated vector and used to determine subsequent parameter updates. Nonlinear activation functions can be modeled as piecewise linear functions, and a change in an activation functions slope can be modeled as a linear update to the expanded matrix. Parameter values and their updates can be quantized or otherwise constrained to simplify update operations.

Inventors:
SHATTIL STEVE (US)
Application Number:
PCT/US2020/045773
Publication Date:
June 17, 2021
Filing Date:
August 11, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GENGHISCOMM HOLDINGS LLC (US)
International Classes:
G06N3/08; G06N3/04; G06N3/063
Domestic Patent References:
WO2019006088A12019-01-03
WO2017214507A12017-12-14
Foreign References:
US20190236445A12019-08-01
CN110050256A2019-07-23
KR20180096474A2018-08-29
Attorney, Agent or Firm:
SHATTIL, Steven (US)
Download PDF:
Claims:
Claims

1. An apparatus for an artificial neural network (ANN), comprising: a memory; and one or more processors, coupled to the memory that includes instructions to execute operations of the one or more processors to: generate a base expanded matrix (100) having a plurality of values arranged in a plurality of rows and a plurality of columns, the plurality of values comprising a plurality of the ANN's parameter values, the base expanded matrix representing an output of a layer of the ANN, wherein a sum of values in each row produces a base output vector of activations; and train the ANN by updating values (150) in at least one column of the base expanded matrix to produce an updated expanded matrix or an update expanded matrix.

2. The apparatus of claim 1, further characterized by instructions to execute operations of the one or more processors to: sum (304) values in each row of the updated expanded matrix to produce an updated output vector; or sum (304) values in each row of the update expanded matrix to produce an update output vector; and compute (305) at least one of an error, a change in the error, a total cost or a change in the total cost from the updated expanded matrix or the update expanded matrix.

3. The apparatus according to claims 1 or 2, further characterized by wherein the base expanded matrix comprises a linear approximation of the ANN; and wherein nonlinear changes in network parameter values are approximated (301) by linear updates, or wherein a change in a network parameter value from a first linear approximation to a second linear approximation is approximated (313) as a linear update.

4. The apparatus according to any of the claims 1 to 3, wherein the at least one of the plurality of columns is updated based on an update to a parameter in a previous layer of the ANN; or wherein at each of a plurality nodes in the ANN, input arguments are compared to breakpoints (302), and upon crossing a breakpoint, a linear update (313) is made to updating column values in at least one of the base expanded matrix, the updated expanded matrix, or the update expanded matrix.

5. The apparatus according to any of the claims 1 to 4, wherein the updated expanded matrix is computed from at least one of an additive update or a multiplicative update to the base expanded matrix.

6. The apparatus according to any of the claims 1 to 5, wherein each column of the base expanded matrix comprises coefficients corresponding to at least one of a set of ANN parameters, or each column of the base expanded matrix comprises products of the coefficients with the at least one of the set of ANN parameters.

7. The apparatus according to any of the claims 1 to 6, wherein each value in the base expanded matrix is computed numerically (321), and the update expanded matrix or the updated expanded matrix is computed by constraining (332) one or more update values to a restricted set of values such that updates (333) to each value in the base expanded matrix can comprise only shifting the base expanded matrix value's bits, changing the base expanded matrix value's sign bit, deleting the base expanded matrix value, and changing the base expanded matrix value's decimal point.

8. The apparatus according to any of the claims 1 to 7, wherein the ANN comprises at least one of a deep-learning neural network, a multi-layer perceptron (MLP), a recursive neural networks (RNN), a recurrent neural network, a bi-directional RNN, a Long Short-Term Memory (LSTM) network, a Gated Recurrent Unit (GRU) network, or a Convolutional Neural Network (CNN).

9. A method for training an artificial neural network (ANN), configured to: generate a base expanded matrix (100) having a plurality of rows and a plurality of columns, the base expanded matrix representing an output of a layer of the ANN, wherein a sum of values in each row produces a base output vector of activations; and train the ANN by updating values (150) in at least one column of the base expanded matrix to produce an updated expanded matrix or an update expanded matrix.

10. The method of claim 9, further characterized by being configured to: sum (304) values in each row of the updated expanded matrix to produce an updated output vector; or sum (304) values in each row of the update expanded matrix to produce an update output vector; and compute (305) at least one of an error, a change in the error, a total cost or a change in the total cost from the updated expanded matrix or the update expanded matrix.

11. The method according to the claims 9 or 10, further characterized by wherein the base expanded matrix comprises a linear approximation of the ANN; and wherein nonlinear changes in network parameter values are approximated (301) by linear updates, or wherein a change in a network parameter value from a first linear approximation to a second linear approximation is approximated (313) as a linear update.

12. The method according to any of the claims 9 to 11, wherein the updated expanded matrix is computed from at least one of an additive update or a multiplicative update to the base expanded matrix; and wherein each column of the base expanded matrix comprises coefficients corresponding to at least one of a set of ANN parameters, or each column of the base expanded matrix comprises products of the coefficients with the at least one of the set of ANN parameters.

13. The method according to any of the claims 9 to 12, wherein each value in the base expanded matrix is computed numerically (321), and the update expanded matrix or the updated expanded matrix is computed by constraining (332) one or more update values to a restricted set of values such that updates (333) to each value in the base expanded matrix can comprise only shifting the base expanded matrix value's bits, changing the base expanded matrix value's sign bit, deleting the base expanded matrix value, and changing the base expanded matrix value's decimal point.

14. The method according to any of the claims 9 to 13, wherein the ANN comprises at least one of a deep-learning neural network, a multi-layer perceptron (MLP), a recursive neural networks (RNN), a recurrent neural network, a bi-directional RNN, a Long Short-Term Memory (LSTM) network, a Gated Recurrent Unit (GRU) network, or a Convolutional Neural Network (CNN).

15. A computer program product, characterized by: a computer-readable medium comprising: code for causing a computer to implement the method of any of claims 9 to 14.

Description:
Artificial Neural Networks

CROSS REFERENCE TO PRIOR APPLICATIONS

[0001] This application is a Continuation of United States Patent Application Serial No. 16/712,954, filed on December 12, 2019, which is hereby incorporated by reference in its entirety.

BACKGROUND

[0002] Aspects of this disclosure relate generally to artificial neural networks (ANNs), and more particularly, to computationally efficient processing in ANNs.

[0003] The background description includes information that may be useful in understanding the present inventive subject matter. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed inventive subject matter, or that any publication, specifically or implicitly referenced, is prior art.

[0004] Neural networks are comprised of multiple hidden layers, and each of the hidden layers has multiple hidden nodes which consist of an affine map of the outputs from the previous layer and a nonlinear map called an activation function. The nonlinear activation function makes neural networks differ from the linear models, that is, a neural network becomes a linear function if a linear activation function is used. The problem of training a feedforward neural network is to determine a number of adjustable parameters or connection weights based on a set of training data. A trained feedforward neural network can be regarded as a nonlinear mapping from the input space to the output space. However, neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets.

SUMMARY

[0005] The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that follows. Any of the functions, techniques, aspects, operations, or concepts described in this Summary may be employed in any of the disclosure herein, including the Drawings and Detailed Description.

[0006] Disclosed techniques can select weights and/or biases to update in a (e.g., supervised and/or unsupervised) learning process that trains an ANN, and update those weights and/or biases. Some aspects adapt the network topology, such as by adding and/or deleting neurons, synapses, and/or layers; adapting the functions performed in neurons, or any combination thereof. Techniques and/or aspects disclosed herein can employ training data to tune, or train, the network (such as the network's parameters, hyperparameters, and/or topology) to program the network to perform any application-specific computations disclosed herein. Aspects can update the network's output or cost function after an update to one or more weights, biases, activation functions, network topology, and/or other parameters or hyperparameters. Aspects may comprise or effect a forward pass (e.g., forward propagation) or backpropagation.

[0007] In one aspect, a set of inputs is provided to an ANN, and a forward pass is performed wherein outputs from one layer are fed forward to one or more other layers. An ANN can be a (nodal) network of interconnected neurons, where each neuron represents a node in the network. Groups of nodes may be arranged in layers, with the outputs of one layer feeding forward to a next layer in a multilayer perception (MLP) arrangement. MLP may be understood to be a feedforward neural network model that maps a set of input data onto a set of output data. In some aspects, each node produces an output that is fed forward to nodes in the layer immediately following it. Each node in a hidden layer may receive multiple inputs, either from the input layer or from outputs of nodes in a preceding hidden layer. In general, each node may apply an activation or transfer function to its inputs to produce an output for that node. Nodes in hidden layers (e.g., learning layers) may apply the same function to their respective input(s) to produce their respective output(s). Nonlinear activation functions, pooling, aggregation, etc. can cause some information to be lost. The node or the function may preserve (or store) some information from its input.

[0008] Before the neural network is trained, the weights may be assigned initial (e.g., random) values. The weights may be trained (e.g., tuned) so that for a given training vector input, the neural network produces an output close to a desired training vector output (label). For example, the expanded matrix at the neural network's output (and/or one or more layers) may be analyzed to determine how much effect each weight has on the overall error. Selected weights in the expanded matrix may be incrementally adjusted (updated) to move the neural network's output closer to the desired training output. In the forward pass of each cycle, a training input may be fed forward through the neural network to determine its actual vector output. Linearity of the model can be exploited to speed up computations. An error for each output node is then calculated based on the actual node output and its target training output (label). This cycle can be repeated until the actual output of the neural network is within an acceptable error range of the desired training output.

[0009] In one aspect, the forward pass can be configured to effect matrix multiplications in place of elementwise vector products and/or matrix-vector products (e.g., at each hidden layer, effectively converting input vectors to matrices, converting weight vectors to matrices, and/or converting activation functions to linear matrix operations; performing matrix-matrix multiplication of these matrices, and outputting at each layer an expanded matrix of the corresponding layer output), so the output layer provides an expanded-matrix output, wherein the values in the expanded matrix's rows can be summed to produce the vector output. An update to the expanded-matrix output can be performed, such as by updating any columns in the expanded-matrix output that correspond to parameter updates. The elements in each row of the resulting updated expanded-matrix output can be summed to produce an updated vector output, and an updated error or cost may be computed therefrom, from which subsequent parameter updates may be determined.

[0010] Disclosed aspects can be implemented by methods, systems of processing elements operating as an ANN, individual processors in an ANN that are programmed to perform disclosed methods, computer software residing on non-transitory computer-readable memory, and electronic circuitry, for example.

[0011] In one aspect, an ANN generates a base expanded matrix having a plurality of rows and a plurality of columns, the base expanded matrix representing an output of a layer of the ANN, wherein a sum of values in each row produces a base output vector. The ANN can update values in at least one column of the base expanded matrix to produce an updated or update expanded matrix, and compute an error and/or total cost from the updated or update expanded matrix (which can include summing elements in each matrix row). The error and/or total cost may be used to update parameters or hyperparameters in the ANN, such as to reduce the error of the ANN's output, or prediction, such as may be measured from a labeled training data set.

[0012] The base expanded matrix can comprise parameters from one or more previous layers. The values in the base expanded matrix can be numerically computed from these parameters. At least one column of the base expanded matrix can be selected for update based on which parameter(s) in a previous layer of the ANN is (are) updated. The updated expanded matrix can be computed from at least one of an additive update or a multiplicative update to the base expanded matrix.

[0013] In one aspect, each column of the base expanded matrix comprises coefficients corresponding to one of a set of ANN parameters. In another aspect, each column of the base expanded matrix comprises products of the coefficients with the one of the set of ANN parameters.

[0014] Aspects can provide for numerically computing each matrix value in the base expanded matrix, and the update expanded matrix or the updated expanded matrix can be computed wherein one or more of the matrix values and/or update values are constrained (quantized) to a restricted (quantized) set of values such that numerical updates to the base expanded matrix values can be simplified (e.g., with respect to processing operations and/or memory access), such as wherein updates to the matrix values comprise a small (e.g., reduced) number of bit shifts, changing the matrix value's sign bit, deleting (or ignoring) the matrix value, or changing the matrix value's decimal point. By way of example, shifting the bit pattern of a number to the right by 1 bit divides the number by 2. Similarly, shifting a number to the left by 1 bit multiplies the number by 2. A decimal left shift of an unsigned binary integer, I, by S digit positions corresponds to multiplying I by 10 s . In some aspects, the ANN comprises multiple electronic synapses connecting multiple electronic neurons. Some aspects can sum values in each row of the updated expanded matrix to produce an updated output vector.

[0015] In some aspects, ANN parameter data structures employed in the training model are configured to commute under multiplication, such as to simplify (computationally) or speed up updates to previous error or cost computations as ANN parameters are updated. In some aspects, a base expanded matrix is updated with respect to a parameter update in a previous layer, followed by updating an activation function in a layer following the previous layer. The base expanded matrix may be an initially computed base expanded matrix, or it may be a previously updated expanded matrix.

[0016] In some aspects, updates to some expanded matrix values are skipped when the values or updates are below a threshold value. This can effect sparse matrix operations, which may reduce computations without significantly reducing the ANN's accuracy. Updating can be performed in response to at least one of ANN parameter updates, network pruning, dropout, or quantization. An update to the expanded matrix can be used for computing gradients, and may be used to determine which parameters to update or not update.

[0017] In some aspects, data inputs to the ANN can comprise measurements of physical signals used for data communications. The input data may include signals received in a wireless communication network, and the ANN can be trained (e.g., it's parameters tuned) for analyzing the signals (e.g., estimating transmitted data symbols, removing interference, separating signals, etc.). The input data may comprise signals to be transmitted in the wireless communication network, and the ANN can be trained (e.g., tuned) for synthesizing the signal (e.g., precoding, spreading, spatial multiplexing, PAPR-reduction, etc.).

[0018] In disclosed aspects, the ANN may employ a machine learning algorithm that comprises an association-rule machine learning algorithm, a clustering algorithm, a k-means algorithm, a collaborative filtering algorithm, an artificial intelligence algorithm, an artificial neural network algorithm, a recurrent neural network algorithm.

[0019] To the accomplishment of the foregoing and related ends, the one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative, however, of but a few of the various ways in which the principles of various aspects may be employed, and this description is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] The disclosed aspects will hereinafter be described in conjunction with the appended drawings, provided to illustrate and not to limit the disclosed aspects, wherein like designations denote like elements, and in which: [0021] FIG. 1A is a diagram of an ANN that can be configured in accordance with aspects of the disclosure.

[0022] FIG. 1B depicts method, apparatus, and system aspects according to the disclosure.

[0023] FIG. 1C is a flow diagram that illustrates some method and apparatus aspects configured in accordance with aspects of the disclosure.

[0024] FIG. 2A is a block diagram that depicts a node, or neuron, in an ANN, and operations that can occur in the node according to disclosed aspects.

[0025] FIG. 2B is a block diagram that depicts a method in which an additive update to a linear ANN model can be performed according to some aspects of the disclosure.

[0026] FIG. 2C is a block diagram that depicts an additive update to a linear ANN model that can be configured for an output computation, such as a gradient computation and/or a computation of a change in error, loss, or cost.

[0027] FIGs. 3A, 3B, 3C, and 3D are flow diagrams that depict some aspects of the disclosure.

[0028] FIGs. 4A, 4B, 4C, and 4D depict method and apparatus configurations according to some aspects of the disclosure.

[0029] FIG. 5 is a flow diagram shown with respect to layers of an ANN that depicts aspects in which expanded matrices can be used in combination with linear updates in an ANN.

[0030] FIG. 6 illustrates data structures and a method whereby network quantization and weight sharing can compress an ANN by reducing the number of bits required to represent each weight.

[0031] FIG. 7 is a flow diagram that illustrates a process according to some aspects of the disclosure. This process can reduce memory requirements, which might enable the model to be fit into on-chip SRAM cache rather than off-chip DRAM memory.

[0032] FIG. 8A illustrates a graphics processing unit (GPU) architecture that can be optimized for signal-processing functions disclosed herein. [0033] FIG. 8B is a flow diagram that is illustrative of a method, functional components of an apparatus, and code segments of a computer program in accordance with aspects of the disclosure.

DETAILED DESCRIPTION

[0034] The description that follows includes exemplary systems, methods, techniques, instruction sequences, and computer program products that embody techniques of this disclosure. However, it is understood that the described aspects may be practiced without these specific details. Apparatuses and methods are described in the following description and illustrated in the accompanying drawings by various blocks, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as "elements"). These elements may be implemented using electronic hardware, computer software, firmware, or any combination thereof.

[0035] FIG. 1 A is a diagram of an artificial neural network (ANN), which can be generalized to any of various types and configurations of ANNs, including (but not limited to) deep- learning neural networks, multi-layer perceptrons (MLPs), recursive neural networks (RNNs), recurrent neural networks, bi-directional RNNs, Long Short-Term Memory (LSTM) networks, Gated Recurrent Unit (GRU) networks, Convolutional Neural Networks (CNNs), and others. A simplified ANN topology is illustrated for the purpose of explaining some disclosed aspects. The ANN can comprise a plurality of layers, such as at least (L-3) to L, each layer having one or more neurons (e.g., nodes), such as Ni nodes 103.1, 103.2,... , 103.N 1 in Layer (L-3), N 2 nodes 102.1, 102.2,... , 102.N 2 in Layer (L-2), N 3 nodes 101.1, 101.2,... , 101.N 3 in Layer (L-l), andN4 nodes 100.1, 100.2,... , 100.N 4 in Layer L. In some aspects, Layer (L-3) might be the input layer, and Layer L might be the output layer. Each node receives one or more input values, such as data inputs “x” to the ANN and/or output values (e.g., activations) “a” from other nodes in one or more previous layers. The node combines its inputs, possibly with weights and/or possibly with a bias value, to produce a linear combination “z” value. The node may then perform a non-linear activation function on z to produce its output (e.g., activation).

[0036] In disclosed aspects, the nodes can employ linear approximations of the non-linear activation functions; or updates to the ANN model, as well as other computations, can be computed using linear approximations. Linear approximations may be updated to account for nonlinearity, and such updates may be linear (e.g., first-order) approximations. In some aspects, small changes to z are exploited for providing linear updates to the ANN model, or at least to portions of the ANN model. For example, an update to a node's output a can be computed from a linear approximation a = g·z for small changes in z, that approximates the non-linear activation a =f(z), where g is a scalar value that approximates the slope of the activation function f in the vicinity of z. Furthermore, the value of g can be updated to account for the nonlinearity off, and such updates to g can be implemented via linear approximations. A matrix-expansion of the model's output (such as the activations of the final layer, or values computed therefrom, such as error, loss, or cost) can be used to facilitate linear approximations and linear-model updates. In some aspects, updates to g can be implemented using a recursive function (e.g., a recursion operation). In some aspects, the non-linear operation f can be performed using recursion to update the node's output, a. For example, a change in z can be used to compute a number N z of iterations, or steps, in a recursion operation that updates g or performs f(z). One advantage of recursion operations is that they can improve time efficiency. Another advantage is that recursion functions are easily configured as iterative functions, which may be memory efficient. For example, a recursive function might call itself repeatedly until a certain condition is met, which may alternatively be reformulated as an iterative function, such as wherein a function repeats a particular task until a condition fails.

[0037] In practice, disclosed aspects can be adapted to various and adaptable ANN topologies, including (but not limited to) different synaptic pathway configurations (e.g., connectivity between nodes, including any of various feed-forward and/or feedback network topologies, which can include network inputs and/or outputs), different layer configurations, different node configurations, etc.

[0038] A layer can comprise one or more neurons (i.e., nodes), each with one or more connections (i.e., synapses) from neurons in a previous layer. The synapses can be weighted to adjust inputs to a node, and the weights (and other ANN parameters) can be adapted as part of aleaming process. At Layer (L-1), for example, node 101.1 may provide weights to its inputs, which comprise node outputs from a previous layer, e.g., Layer (L-2), such as outputs from one or more nodes 102.1-102.N 2 to produce combined input . At Layer (L-1), for example, node 101.1 can comprise an activation function that operates on the combined input and produces node output . In some aspects, the activation function can cause the neuron to be inactive or active, and can scale its output. The activation function can be adapted as part of a learning process, such as with respect to a bias .

[0039] FIG. 2 A is a block diagram that depicts a node of layer l of the ANN, and method operations that can occur in the node. For example, at layer /, N outputs from at least one previous layer can be weighted (e.g., multiplied by synaptic weights 201.1-201.N. The weighted inputs are summed 203, possibly with a bias term , to produce a combined input to the node. The notation, , includes the combined values (i = 1,... ,N) and may or may not include bias . The input may be operated upon by the activation function to produce the node's output . Alternatively, the activation function and combined input may be input to an activation linearizer 205, which can be configured to provide a linear approximation of the activation function . In one aspect, the activation linearizer 205 computes a scaling multiplier from and . The scaling term can be computed based on the slope of at, or in the vicinity of, . A multiplier 207 multiplies and to produce the product, - which can be a linear approximation of . Disclosed aspects include operations wherein the node updates one or more of its corresponding ANN parameters, including its synaptic weights, bias, input, activation function, and/or scaling term.

[0040] In one example, layer / might be the input layer, and can be the ANN input value, xo. In one example, layer / might be the output layer, and can be the ANN output value yo. Alternatively, layer / might be a hidden layer, and the inputs might be outputs from a previous layer and the output might be input to a following layer.

[0041] FIG. 1B depicts method, apparatus, and system aspects of the disclosure. At least one matrix expansion of one or more ANN parameter sets is generated 100. The resulting expansion can be updated 150 by updating values (e.g., in selected one(s) of the columns in the expanded matrix A L ). Updates 150 can be additive updates, multiplicative updates, or combinations thereof. The resulting updated expanded matrix A L (u) can be processed 170 in accordance with ANN processing disclosed herein. ANN processing 170 can include configuring 160 the updates in 150 according to one or more criteria in the processing 170. In one example, ANN parameters are updated via 150 to reduce or minimize a total cost, and might employ gradient descent, momentum, RMSprop, adaptive moment estimation (Adam) optimization algorithm, or related techniques. Processing 170 can include selecting or adapting ANN hyperparameters, referred to herein as hyperparameter tuning.

[0042] Matrix expansion 100 comprises increasing the dimensionality of the ANN parameter's data structure. For example, an NLXI vector a L output at layer L can be expanded to produce an N L XN L matrix A L , from which a L can be produced by summing the elements in each row of A L . Matrix expansion 100 can apply to other data structure types, such as to expand a matrix to produce a tensor, or to increase the dimension of a tensor.

[0043] By way of example, but without limitation, an input vector a l-1 to an ANN or a portion of an ANN (denoted as layer I) can be formatted (or otherwise implemented functionally) 122 as a diagonal matrix A l-1 , whose diagonal elements are the elements of vector a l-1 :

This is referred to as a diagonal expansion matrix, or a matrix expansion, of the NX1 vector a l-1 . The matrix A l-1 might be processed by each node in layer /, explicitly or functionally.

[0044] An NX1 weight vector at each node (indexed by j) in layer / can be implemented functionally as NXN diagonal matrix with diagonal elements set to the elements in :

The product 123./ , can produce an NXN diagonal matrix whose diagonal elements are element-wise products of the diagonal elements of each of and A l-1 . The weighting 123./ can include a Hadamard product. The Hadamard product (also known as the Schur product or the entrywise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands, where each element i,j is the product of elements i,j of the original two matrices. For two matrices X and Y of the same dimension mxn, the Hadamard product X · Y is a matrix of the same dimension as the operands, with elements given by

The Hadamard product is associative and distributive. Unlike the matrix product, the Hadamard product is commutative. Thus, in some aspects, matrix forms that commute, and operations thereon, can be configured to provide a result that is analogous to the Hadamard product of two vectors. For example, diagonal matrices or Toeplitz matrices may be employed. Disclosed aspects that exploit this and other features can provide advantageous solutions for training, operating, analyzing, or updating ANNs. Such aspects can improve the functioning of a computer processor and related technological processes disclosed herein. Furthermore, data structures disclosed herein can improve the way a computer processor stores and retrieves data in memory for ANN operations. Some benefits of the disclosed aspects include faster processing time, improved flexibility for updating signal features, and improvements to how a computer stores and reads data from memory to perform ANN operations.

[0045] At each (J) node, an element-wise activation 125./ of the values can be performed to produce an output expanded activation matrix, . In one aspect, activation 125./ can comprise summing all of the non-zero elements in to produce a sum z (which is a scalar), which may be operated upon by a non-linear activation function · In some aspects of the disclosure, the function can be approximated via less-complex computations, such as a linear approximation. The argument of / may further include an added bias . The output of the non-linear activation function can include the scalar activation value . In aspects of the disclosure, the non-linear activation can be computed as a function of z and then performed element-wise on the values to produce the output expanded activation matrix, , from which the value might be computed by summing the non-zero elements in . In some aspects, element-wise activation 125./ can compute an activation scaling term , where . An NXN diagonal matrix whose diagonal elements equal the scaling term may be implemented, functionally or explicitly, for scaling the values. The expanded activation matrix can be expressed as the following diagonal matrix:

Note that the matrix output at the j th node of level l is a product of diagonal matrices:

Since diagonal matrices commute for multiplication, some disclosed aspects can exploit this property to update a previous numerical computation of matrix with an update or updated component of . If is a linear approximation of the activation function, an update to can be configured to capture a non-linear aspect of the activation function, and this update can be implemented as a linear update to the numerical computation of , such that a full numerical recomputation of does not need to be performed to capture the nonlinear quality of the activation function. This can comprise an additive or multiplicative update, or a combination thereof. Some aspects provide for a linear update to an expanded matrix downstream (i.e., in the direction of propagation) from level /.

[0046] The matrix that is input at each node at layer /+ 1 is:

Weighting and element-wise activation may be employed for each layer between / and L. At layer L, an NL-IXNL-I diagonal matrix A L-1 comprising activation outputs (each of those outputs possibly being diagonal matrices) from each of N L-1 nodes in layer L-l may be provisioned for one or more nodes at layer L. Weighting 123. L using an expanded weight matrix (or functional equivalent) may be performed, followed by an element-wise activation of expanded arguments 125. L. The activation output of each node in layer L may be a row comprising the elements produced by the element-wise activation 125. L. For example, row j of A L can be expressed as wherein the terms can each comprise a vector of their diagonal elements, and similarly each activation back to the l th layer can comprise a vector of their respective diagonal terms. [0047] The terms in A L may be arranged with respect to a set of parameters to produce a set of elements, and each element in A L is computed 140 numerically to produce an initial computed matrix A L (0). For example, the terms in each row of A L can be arranged or grouped such that the first element in each row (i.e., column 0 of A L ) is a product of a scaling term with a first ANN parameter po, the second element in each row (i.e., column 1 of A L ) is a product of a scaling term with a second ANN parameter pi, and so on. This arranging of terms in A L by parameter p produces an A L matrix defined by a particular parameter set. Different A L matrices may be created with respect to different sets of parameters, and their elements computed numerically 140. In one aspect, an A L matrix can be created with respect to the inputs a l-1 . In another aspect, an A L matrix can be created with respect to a set of the weights in layer /. For example, each of a set of A L matrices may be created with respect to a common weight index value i or j. In another aspect, an A L matrix can be created with respect to the z-value at each node in layer /. In another aspect, an A L matrix can be created with respect to the activation scaling terms in layer /. In another aspect, an A L matrix can be created with respect to the product of each activation scaling term with its corresponding z-value in layer /. In other aspects, the A L matrix elements are produced by grouping parameters from other layers (e.g., layer λ ≠ /).

[0048] In the update 150, values in one or more columns of A L (0) can be updated with respect to an update to a parameter that corresponds to the one or more columns. Thus, a partial update to A L (0) can be effected. The update may be a multiplicative update, an additive update, or both. The update can comprise multiplication of at least one column by an update scaling term. In one aspect, the product of the parameter with the update scaling term equals the updated parameter, so the update 150 can be a multiplicative update. For example, the updated A L (u) can equal A L (0) after its one or more columns is updated according to the update scaling term. In another aspect, the product of the parameter with the update scaling term equals the update to the parameter, so the update 150 can comprise an additive update to A L (0). For example, an update ΔA L can equal A L (0) after its one or more columns is updated according to the update scaling term, and then ΔA L can be added to A L (0) to produce A L (u).

[0049] It should be appreciated that one or more supplemental and/or alternative operations can be implemented with respect to the weights and activation functions, these operations including invertible transforms (e.g., FFTs, wavelet transforms, z-transforms), convolutions, filters, encoding, precoding, eigen-decomposition, singular value decomposition, and others. The methods and apparatus aspects disclosed herein with respect to mathematical operations and matrix (any matrix, including vectors and tensors) structures can be implemented functionally so as to effect the disclosed operations and structures. Some aspects can employ FFTs in a CNN, as convolution can be efficient computed in the Fourier domain, where it becomes elementwise multiplication. Such implementations may not explicitly comprise such structures. For example, expanded matrices, diagonal matrices, and operations thereon may be effected via various data structures and algorithms in computer code, data storage schemes in memory, circuit designs, processor architectures, etc.

An initial output A L (0) of layer L can be computed based on an initial input, A l-1 (0):

A L (0) = G L W L ... G l W l A l-1 (0)

The operation, G L can be expressed as a function of the previously computed output A L (0): 1 An updated output A L (u) can be expressed as:

Substituting G L expressed as a function of the previously computed output A L (0) yields:

The terms [ W L ] -1 W L reduce to the Identity Matrix I, which can be removed. Subsequent matrices paired with their inverses drop out, so the above expression simplifies to: which expresses a multiplicative scaling of A L (0) based on A l-1 (u) to produce the updated matrix, A L (u). Alternatively, an additive update ΔA L to A L (0) can be expressed as (and implemented by): where ΔA l-1 is the update to A l-1 (0): A 1-1 (u) = A l-1 (0) + ΔA l-1 . In some aspects, scaling terms disclosed herein (such as A L (0)[ A l-1 (0)] -1 ) can be configured to simplify computations. In one aspect, for example, the matrix to be inverted in the scaling term can be selected to be invertible (e.g., nonsingular, or non-degenerative).

[0050] In one aspect, A l-1 (u) = A l-1 (0)A u or ΔA l-1 = A l-1 (0)A u , where A u is a selected multiplicative matrix update to A l-1 (0). Thus, A L (u) = A L (0)A u ; or ΔA L = A L (0)A u , and A L (u) = A L (0)(I + Au). Such aspects may be used in any of the examples disclosed herein. [0051] An initial output A L (0) of layer L can be computed based on an initial weight matrix, W l (0):

The operation, G L can be expressed as a function of the previously computed output A L (0):

An updated output A L (u) can be expressed as:

Substituting G L expressed as a function of the previously computed output A L (0) yields: which can simplify to: which can be used to implement a multiplicative update. The disclosed examples herein may be adapted to provide additive updates. The initial weight matrix W l (0) might be selected to simplify the computation of its inverse. In one aspect, if W l (u) = W l (0)W u , where W u is a multiplicative update matrix, then

Wu can be configured to simplify the above expression. If the weight updates are constrained to Wu = A l-1 Ω u , where Ω u is some update matrix, then which can reduce the complexity of the update computations.

[0052] In some aspects, two or more terms in the matrix expansion can be configured to form a commutative algebra. By way of example, if W u and A l-1 or W u and [ A l-1 ] -1 form a commutative algebra, then

Thus, a commutative-algebra constraint can simplify update computations. In one aspect, W u and A l-1 or W u and [A l-1 ] -1 can be circulant matrices.

An initial output A L (0) of layer L can be computed based on an initial activation scaling matrix, G l (0) and argument Z l (0):

The operation G L can be expressed as a function of the previously computed output A L (0):

Substituting G L expressed as a function of the previously computed output A L (0) yields: which can simplify to:

In one aspect, Z l (u) is a scaled version of Z / (0): Z l (u) = Z / (0)ζ u , where ζu can be a diagonal scaling matrix, thus a multiplicative update can be performed: A L (u) = A L (0)ζ u In another aspect, an additive update can be performed as follows: where Z l (u) = Z / (0) + AZ l .

[0053] In another aspect, G l (0) can be updated as follows. Substituting G L expressed as a function of the previously computed output A L (0) yields: which can simplify to: which simplifies further if G l (0), G l (u). and Z / (0) are diagonal matrices (and thus commute under multiplication):

In one aspect, G l (u) = G l (0)γ u , where y u is a diagonal scaling matrix, so A L (u) = A L (0)γ u provides a multiplicative update. In another aspect, an additive update is performed: and A L (u) = A L (0) + ΔA L .

[0054] In some aspects, both G l (0) and Z / (0) are updated, wherein the update to Z l (0) can necessitate the update to G l (0). In one aspect, the updated Z l (u) = Z l (0) + Z 1 +Z 2 , where Z 1 is the distance from Z l (0) to a Z-breakpoint of a PWL approximation of the nonlinear activation function where G l (0) becomes G l (u), and Z 2 = ΔZ l - Z 1 . Thus, one approximation of an additive update might be: Another approximation might compute a scaling factor G l (u') for the entire amount of ΔZ l that approximates the combined effect of G l (0) employed up to the break point (0 to Z 1 ) and G l (u) employed after the break point is crossed (Z 1 to ΔZ l ).

[0055] An initial output A L (0) of layer L can be computed based on an initial activation scaling matrix, G l (0): The operation, G L can be expressed as a function of the previously computed output A L (0):

Substituting G L expressed as a function of the previously computed output A L (0) yields: which can simplify to: In some aspects, the above expression can simplify to: such as if[G l (0)] -1 G l (u) is a multiple of the Identity matrix, or if W l is configured to be diagonal.

[0056] In some aspects, operators disclosed herein can comprise an interpolating function, such as an interpolation filter. In some aspects, an operator can employ a Vandermonde matrix. In any of the disclosed aspects, an MNXMN expanded matrix might be computed from an NXN matrix. In some aspects, an operator might-comprise an M-N-point transform (where M is an integer > 1) configured to operate on an MNXMN matrix constructed, for example, by performing “zero stuffing” or “zero padding” of its input. One aspect might employ an expanded operator, such as an MN-point interpolation filter that operates on an MNXMN zero-stuffed operand matrix to produce an MNXMN expanded matrix.

[0057] FIG. 2B is a block diagram that depicts a method in which an additive update to a linear ANN model can be configured. An updated i th node's input can be processed by an activation linearizer 211, which compares the input to one or more thresholds, such as breakpoint values for linear segments in the linear approximation of the activation function, and outputs one or more corresponding activation scaling terms to an update 213 of a layer output in expanded-matrix form. The ANN output at layer L (A L ) or some other layer (A l ) can be provisioned in an expanded-matrix form to enable additive and/or multiplicative updates. In one example, an expanded-matrix form of A L comprises elements that are factors of (e.g., one or more of the inputs at layer l), and denoted by . In some aspects,

A L might comprise elements that are factors of denoted by . The elements of A L may be computed numerically. By way of example, an expanded-matrix update (e.g., can be computed 213 by scaling the elements of A L by the values. The Δpdate value may be partitioned across multiple linear segments of the approximation.

For example, for two partitions, , where the first partition corresponds to a first segment (and scaling term ) and the second partition corresponds to a second segment (and scaling term , the expanded matrix update ΔA L might comprise a sum of a first update computed from and a second update computed from . Each expanded matrix update can be converted 215 to an update vector Δa L by summing the terms in each row. The update vector(s) Δa L can be additive vectors, which may be added 217 together to produce an output vector a L , or may be added 217 to a previous output vector to a L ( 0) to produce an updated output vector a L (ii). In multiplicative-update aspects, the and values are scaling terms that operate on A L to produce at least one updated A L , which may be converted to a L to produce an updated output vector a L (u).

[0058] FIG. 2C is a block diagram that depicts an additive update to a linear ANN model configured for an output computation 219, such as a gradient computation and/or a computation of a change in error, loss, or cost. It should be appreciated that any of various types of linear approximation 211 can be used in the disclosed aspects.

[0059] In the disclosed aspects, piecewise linear (PWL) approximation can provide low maximum and average error with low computational complexity. An approximation might employ the A-law companding technique. In some aspects, the linear approximation is configured such that the gradient of each linear segment is expressed as a power of two. This enables replacing multipliers with shifters. PWL activation functions have zero curvature (i.e., constant first-order derivative) inside each interval divided by its break points.

[0060] In some aspects, such as depicted in FIGs. 3A and 3B, a PWL approximation may comprise dividing the nonlinear activation function into segments 301, which can include exploiting symmetry (e.g., in sigmoid, tanh, etc.) to reduce the number of computations. An update to the activation function’s argument can be compared to the linear approximation’s breakpoints 302. If the updated argument does not transition any of the breakpoints, an update to the ANN model is computed as a linear update to the expanded matrix A L 303. This update 303 may be any combination of additive and multiplicative updates, and can comprise multiplying corresponding column(s) of an expanded matrix (e.g., A L and/or ΔA L ) with one or more scaling terms. Rows of the update matrix ΔA L and/or updated matrix A L can be summed 304 to produce update Acl· or updated cl·. One or more of the ANN’s updated output parameters (and/or changes to those output parameters) can be computed 305 from the update ΔA L or updated A L , or update Acl· or updated cl·. The ANN output parameters can include expectations (y), error, loss, cost, and/or the gradients thereof.

[0061] In some aspects, the comparison 302 may indicate that an updated argument transitions one or more of the breakpoints. The change(s) in the linear model due to such transitions can comprise one or more linear updates 313 to the ANN model, such as by performing a linear multiplicative and/or additive update to expanded matrix A L or ΔA L .

Rows of the update and/or updated matrix can be summed 304, and one or more of the ANN’s updated output parameters (and/or changes to those output parameters) can be computed 305. One aspect provides for linearizing 301, at finite quantization intervals, the nonlinear activation functions to provide a PWL approximation. An update to the argument(s) of an activation function is evaluated 302 with respect to the breakpoints of the PWL approximation (which define each quantization interval in the PWL approximation).

For an argument update that spans multiple quantization intervals, an additive update 313 to A L (or ΔA L ) can be performed as a sum of partial updates: where A L (0) denotes an initial or previous update of the expanded matrix A L , k is an index that spans the set of K quantization intervals, is the extent of the argument update in the k th quantization interval, g is the activation scaling term of the k th quantization interval, and denotes the initial or previous expanded matrix A L with corresponding ones of its columns comprising update terms and .

[0062] The process of updating A L (0) with and can comprise a sparse-matrix update, as only a subset of its columns are updated. ΔA L can comprise a sparse matrix operation on A L , wherein only a subset of the columns in A L are provisioned as the one or more columns in ΔA L . Updates to A L or ΔA L with respect to any of the other ANN parameters can comprise sparse-matrix updates. Sparse-matrix updates to an expanded matrix can take the form of multiplying the expanded matrix with a sparse diagonal matrix (i.e., wherein at least one of the diagonal elements are zero). However, such operations can be effected by bit-shifting, sign changes, and/or zeroing elements, such as disclosed herein. The hardware and/or software disclosed herein can optimize expanded-matrix processing operations and/or partial updates, and can include a variety of optimization solutions for sparse processing. A GPU architecture can be adapted for optimizing global memory access, optimizing shared memory access, and exploiting reuse and parallelism. Optimizing sparse processing operations can include characterizing memory access cost, access pattern, type and level of memory, and exploiting data locality. Exploiting reuse can include caching each element in on-chip memories, and exploiting parallelism can include employing synchronization-free parallelism. Aspects disclosed herein can provide for optimizing dense and/or sparse operations (including sparse matrix-matrix multiplication, sparse transforms, and other operations that involve or are based upon matrix expansion and/or expanded linear transform operations) on graphics processing units (GPUs) using model-driven compile- and run-time strategies.

[0063] Some aspects can exploit mixed-precision ANN parameters, such as depicted in FIGs. 3C and 3D. In one aspect, some values within the ANN can be high precision, whereas at least some parameters or parameter updates can be configured 322 to be low precision. An initial computation of A L , ΔA L , a L , and/or Δa L , for example, might comprise high-precision operations 321, and updates 323 can be provisioned with low-precision values (from 322) in a process that simplifies update operations 323 that are performed on the initial computation (e.g., A L (0) and/or ΔA L (0)). The rows of the updated expanded matrix (e.g., A L (u) and/or ΔA L (u)) may be summed 324. By way of example, but without limitation, updates to some parameters (e.g., weights and/or activation scaling terms can be constrained to a simplifying update constraint, such as a set of low-precision values that are powers of two 332 (or some other simplifying update constraint), such that an update to the initial (or a previous) computation of A L , ΔA L , a L ·, and/or Δa L can be implemented with low-complexity operations, such as bit shifting 333. Ternary values (-1,0,1) are another possible simplifying update constraint that can be implemented in 322. By way of example, a low-complexity update 323 to elements in A L (0) and/or ΔA L (0) might comprise performing sign changes, zeroing out (e.g., discarding) terms, or combinations thereof. In some aspects, activation functions, such as ReLU, provide for a simple update mechanism by employing a slope of one when activated, and zero (or close to zero) when deactivated. Thus, the non-linearity of the ReLU can be implemented by selecting or deselecting terms in an expanded matrix, such as (but not limited to) A L , ΔA L , or Z.

[0064] In one example, a sigmoid function y = f(x) can be implemented as follows. Since the sigmoid function has a symmetry point at (0, 0.5), only half of the x-y pairs need to be explicitly computed (e.g., y x>0 = 1 - y x≤0 or y x<≥ = 1 - y x≥0 ). Thus, the following computations need only be performed on the absolute value of x:

In this aspect, the slopes and intercepts are powers of two, which can enable the ANN to use shift and addition operations to perform updates. The expanded matrix update 213 can advantageously be implemented via shifting the values in A L , such as due to updates in values, since the slopes are powers of two. In some aspects, a bias term (added to the can be configured to change the activation function’s output (e.g., y) by a factor that is a power of two or by an additive amount that is a power of two, such as to enable the expanded matrix update 213 to be implemented via bit-shifting operations. Similarly, updates can be configured to be implemented in the expanded matrix update 213 via shifting operations. For example, updates can be constrained to powers of two. Configuring the updates to be constrained to powers of two can be achieved, for example, by implementing methods that involve constraining the constituent values of or their updates to powers of two. For example, an update to a weight value can result in the update . Thus, one aspect might configure weights W and the node outputs to be powers of two. Another aspect might adapt the weights to cause to be a power of two. Some aspects can configure the change in the node's activation (e.g., , or a sum of changes in the activation) to be a power of two so the update to A L can be performed by shifting the corresponding numerical values in A L .

[0065] One approach provides linearizing, at finite quantization intervals, the nonlinear functions used to model neuron activations via piecewise linear approximation. These solutions provide parameters corresponding to each quantization interval, such that the implementation of a different neuron model for certain aspects of the present disclosure may involve a simple substitution of the parameters. Further solutions provide for linear updates when a function's argument is updated to some value beyond a finite quantization interval, such that the change in the function's linear approximation is itself modeled as a linear approximation. For example, when an update to the argument causes the argument to transition from one quantization interval to another quantization interval (e.g., across a breakpoint), wherein each quantization interval corresponds to a different piece of the piecewise linear approximation, the update to the activation can be modeled as a linear update. Such updates can be configured to update corresponding terms in an expanded-matrix model.

[0066] Some examples of activation approximations include the Taylor expansion method, the average slope method, linear extrapolation, linear prediction, the first order linear interpolation method, and the optimal linear interpolation method. An affine transformation may be employed.

[0067] Some aspects can use an algorithm based on the centered recursive interpolation (CRI) method, which is designed to improve the accuracy recursively. A CRI method begins with initializing three straight lines: y 1 (x) = 0, y 2 (x) = ,5*(1+x/2) y 3 (x) = 1

Since only the positive x-axis needs to be considered, the CRI algorithm is as follows: g(x)= y 2 (x); h(x) = y 3 (x); for (i=0; i=q; i++) { g’(x) = Min [g(x),h(x)]; h(x) = .5*(g(x)+h(x) - delta); g(x) = g’(x); delta = delta/4; } g(x) = Min[g(x),h(x)]; where q is the interpolation level, delta is the depth parameter dependent on q, h(x) is the linear interpolation function, and g(x) is the resulting approximation. Neither multiplications nor divisions are needed, since they are reduced to shiftings. Suggested values for delta are: .30895 for q = 1 (5 segments) .28094 for q = 2 (9 segments) .26588 for q = 3 (17 segments)

[0068] FIG. 4A is a block diagram that depicts a linear update to an ANN model based on an update to a non-linear activation function. For a given updated value of , its activation can be computed 401 to produce an updated activation . In some aspects, control passes to step 403, where an expanded matrix update to A L can be performed 403. The update 403 can be performed via various ways. In one aspect, the updated activation value can be substituted in place of its initial or previous activation value in an expression for the initial or previous expanded matrix A L (0) to produce an updated expanded matrix A L (u). The rows of A L (u) can be summed 404 to produce an updated a L , (a L (u)).

[0069] In some aspects, the update 403 can employ numerically computed elements of an expanded scaling matrix A L [p] of a parameter “p”, wherein p can be any ANN parameter, e.g., . Some elements of A L (e.g., one or more columns in the first matrix on the right side of the above equation) can comprise numerically computed multipliers that multiply (e.g., scale) p. For example, at least the second column of A L can comprise products of scaling terms with p. As shown above, p can be removed from the second column of A L to produce A L [p]. Thus, scaling terms in the second column of . In one aspect of the update 403, the numerically computed scaling terms of p in A L [p] can be multiplied by an updated parameter p(u) to produce an updated expanded matrix A L (u). For example, matrix A L [p] might be multiplied by a diagonal matrix having at least one diagonal element equal to p(u). In this example, the second diagonal element can be p(u), which corresponds to the second column of A L [p]. In another aspect of the update 403, the numerically computed scaling terms of p in A L [p] can be multiplied by an update parameter Δp(u) to produce an updated expanded matrix ΔA L (u). which can be added to an initial or previous A L (0) to produce A L (u). Multiplication described herein can be simplified to bit-shifting operations, such as if p(u) or Δp(u) are constrained to powers of two. Thus, the bits in the scaling terms can be shifted in accordance with values of p(u) or Δp(u). Updates with low computational complexity can be employed by using other parameter constraints, such as ternary values, e.g., (-1,0,1), or binary values, e.g., (0,1) or (- 1,1).

[0070] In some aspects, update 403 may include providing for summing all the scaling terms of a subject parameter p in each row of A L [p] to produce a single scaling term. Thus, update 403 may provide for a single column of scaling terms in A L [p], The one or more other elements in each row that do not scale p may also be summed, resulting in a single column of constant terms in A L [p], Thus, A L [p] may be a two-column matrix, comprising a scaling vector a L [ p] or Δa L [ p]. and a vector of constants c[!p] that do not operate on p. The update 403 can provide for element-wise multiplication (such as a Hadamard product) of a vector of p(u) or Δp(u) values with the scaling vector in A L [p] to produce scaled values of p(u) orΔp(u) . This multiplication may be implemented via bit-shifting the values in the scaling vector according to the values in the vector p(u) or Δp(u) values. The scaled and constant values may be summed 404 in some aspects. An update or updated ANN output parameter may be computed 405.

[0071] In some aspects, a scaling term (and optionally an intercept) can be computed 412 to approximate the nonlinear activation of an update argument Δz. In one aspect, an expanded A L matrix can be updated by the scaled Δz. An additive update based on the optional offset may be provided. In some aspects, the scaling term and the offset are set to values that are powers of two, wherein the sum of scaled Δz and the offset approximates the update to the nonlinear activation.

[0072] The updated parameter p(u) may be the updated activation or the update activation (e.g., the change in activation, . The value might be configured to be a power of two, such that multiplying the factors can be effected via bit- shifting operations. For example, the argument(s) of the activation function can be configured such that the values or are powers of two. In some aspects, the update 403 can comprise quantizing each parameter value (e.g., p(u) or Δp(u) ) to a power of two. For example, may be quantized to the power-of-two value nearest to . [0073] In some aspects, step 402 is employed, wherein at least a slope is computed 402 from the current pair and a previous pair. The slope of the activation function can be approximated as:

In one aspect, , where . In this case, the parameter . Update 403 can provide for multiplying matrix with , and then summing the product with an initial or previous A L . In some aspects, can be quantized to a power of two, and the multiplication can be implemented via bit-shifting corresponding numerical values in .

[0074] In some aspects, an offset value m is computed to reduce the quantization error in , and the offset value m may be constrained to be a power of two. The update 403 can comprise bit shifting the matrix A with respect to m to produce an additive update to an initial or previous A L . Offset values may be employed to reduce quantization error with respect to any of the other ANN parameters.

[0075] It has been observed that while training an ANN, the learned parameters (e.g., weights and biases) often change by only small amounts upon each iteration, which can be exploited to provide linear approximations to the nonlinear activations, and thus, to the ANN model or portions thereof. When ReLU activation functions are employed, for example, it is often observed that nodes are in an activated or de-activated state for many iterations while learning. Thus, many iterations can employ updates to the ANN via linear approximations, at least for a portion of the ANN model. However, disclosed aspects are not limited to particular activation functions or small parameter changes, and some aspects can model non-linearities (including any non-linear activations) via linear approximations. ReLUs are advantageous in that the arguments of the activation function are passed in the activated state, and in the deactivated state, the zero might be implemented by discarding corresponding terms in the expanded matrix of interest.

[0076] With reference to FIG. 1A, at the input to Layer L, which comprises one or more (N 4 ) nodes, outputs from Layer (L-1) may be weighted, and a bias may be added, to produce input vector z L , comprising such as represented by:

Function is then performed on z L at Layer L to produce outputs . An error function (sometimes referred to as loss or cost) can be computed for each value of the output vector a L , and a total loss or cost function can be computed from the set of computed errors. Disclosed aspects can provide for computationally efficient updates to inputs z L of Layer L, the outputs a L of Layer L, the computed errors . and/or the total cost. Note that Layer L can be inferred to be a hidden layer or an output layer of the ANN.

[0077] In accordance with aspects of the disclosure, vector z L , disregarding the bias, can be expanded to a matrix form, Z L , such as: wherein weight matrix W L operates on the expanded matrix A L-1 . whose diagonal elements are the Layer (L-l) output values, e.g., . The vector z L , which may or may not include the bias, can be obtained by summing the elements in each row of Z L . In some aspects, the bias terms can be employed in the conversion of the activation functions to linear approximations.

[0078] By way of example, for each row of the matrix Z L , the terms are summed and a bias term corresponding to each row is added to the sums, and then the Layer L activation function can be performed on the sum to produce an activation, or output. In some of the aspects disclosed herein, activation functions can be implemented using a recursion function. In some of the disclosed aspects, the activation function can be implemented via scalar multipliers applied to the vector z L . Thus, a bias can be implemented to change or adapt the corresponding activation scaling term . In some aspects, the scalar multipliers may be computed using a recursion function. The scalar multipliers can be employed in an expanded-form scaling matrix G L . The output of Layer L is the expanded matrix A L :

[0079] In some aspects, one or more columns or rows of zeroes may be added to the expanded matrix Z L to satisfy the necessary matrix dimension requirements of the multiplication of G L with Z L . Similarly, zero insertion can be effected in the examples throughout this disclosure to achieve the necessary matrix dimensions for the operations disclosed herein.

[0080] If an updated at node i is determined to be within a predetermined threshold, the activation of the updated can be approximated using a linear model, possibly with respect to a previous activation. For example, the scalar multiplier corresponding to a previous can be used for the linear approximation of the activation function applied to the updated When the updated exceeds the threshold, the corresponding scalar multiplier can be updated. In some aspects, this can include recomputing using a recursive function. The model can employ a linear update to effect the update to The model update may employ an additive update. For example, an additive update can be used in the expanded matrix format employed in the model. The model update may employ a multiplicative update. For example, a scaling update to to produce an updated can multiply corresponding elements in rows or columns of the model’s expanded-matrix form.

[0081] Changes to bias , weights W L , and activations from previous layers can affect the values , and thus, the activation function . Disclosed aspects can provide for tracking such changes to such as in the affected nodes of forward layers, and provide for updating in those nodes as necessary. If at a given node, the update from a previous layer is scaled by a small amount (e.g., a weight that is close to zero, or below a threshold value), the evaluation of the update's effect on the activation can be skipped. The activation function and/or the activation scaling weights can be updated using recursive relationships, such as recursion functions. It should be appreciated that additive or multiplicative updates may be made to parameters described herein.

[0082] In some aspects, at each node where the argument of the activation function changes due to an update to one of the model's parameters, the change in the argument can be computed and compared to at least one threshold value in order to determine if the linear approximation of the activation function needs to be updated. If necessary, the update to the activation scaling weight can be made, followed by a corresponding linear update to the network model.

[0083] In one example, an update to activation scaling weight is implemented as a multiplicative (e.g., scaling) update or an additive update to row “0” of column vector a L or expanded matrix A L . Alternatively, the update to an error, loss or cost can be computed. In some aspects, the error, loss or cost is implemented functionally using an expanded-matrix format. In some aspects, this may be computed as an update to a previously computed error corresponding to row “0”. In one example, an update to a weight updates a particular element (e.g., 0,0) in A L . This can be implemented as a scaling update or an additive update to row “0” of column vector a L or element (0,0) of expanded matrix A L . In some aspects, this may be computed as an update to a previously computed error corresponding to row “0”. In one aspect, an update of a previous layer's (e.g., layer L-l) activation (e.g., ) updates a particular column (e.g., column 0) in A L . This can be implemented as a column update to A L or an additive update to vector a L . In some aspects, this may be computed as an update to each previously computed error. [0084] In some aspects, a node's activation function may be turned off, effectively eliminating the node, which zeros the node's outputs. In some aspects, this models removal of the node. In other aspects, this can model the zero-output state of a ReLU activation. Thus, the effect of zeroing the node may be computed as one or more linear updates to the model by removing or ignoring corresponding values in the model's expanded matrix. In some aspects, a change to input data to the ANN can be modeled as a linear update.

[0085] In an exemplary aspect, an update to can be represented as = · The update can be represented as: which can be implemented as an additive update to an (initial or previous) A L (0) to produce an updated expanded matrix A L (u). Alternatively, the non-zero elements in each row of ΔA L can be summed to produce update column vector Δa L . which can be added to (initial or previous) column vector a L (0) to provide an updated column vector a L (u). In some aspects, column vector Δa L can be used to compute gradients. Updates to the network's parameters can be implemented as simple bit operations applied to values (represented here by expanded-matrix format) stored in memory from one or more previous iterations.

[0086] Such techniques can be adapted to providing linear updates to Z L , for example. Network updates due to updated weights and/or biases may be provided (and adapted as necessary) in a similar manner as described herein. It should be appreciated that such techniques can be adapted to updates that were made to parameters in earlier (i.e., previous) layers, and multiple layers of the ANN (and their updates) can be modeled via one or more linear approximations.

[0087] In one aspect, ΔZ L resulting from is computed:

And the updated z L values at each node in layer L can be compared to threshold values or breakpoints to determine if the argument of each activation function has changed sufficiently to require an update to the corresponding (i) activation scaling weight . The model update can be made with respect to the Layer L activation, error, loss or cost.

[0088] FIG. 5 illustrates aspects in which expanded matrices can be used in combination with linear updates in an ANN. One or more expanded matrices can be computed 500 for Layer L based on parameters at Layer L-3. An update 501 to one or more parameters in Layer L-3 may be made. For example, parameter updates may be based on gradient computations (e.g., 508). The update 501 may comprise updates to one or more activation scaling factors in Layer L-3. The expanded matrix (or matrices) can be updated 502 based on the parameter updates 501.

[0089] In Layer L-2 (and subsequent layers), the updates to arguments z are evaluated at each node as part of a scaling-factor update process 503, 505. If the update to z is determined to be zero or negligible (e.g., below a threshold value) at a particular node, the evaluation may be skipped, and no update to that node's activation scaling factor is made. In nodes that pass the evaluation, the update to z is evaluated to determine if an update 503, 505 to each node's activation scaling factor is to be made. An update 503, 505 to an activation scaling factor can be followed by a corresponding update to the expanded matrix 504, 506, respectively. After all the updates are computed (502, 504, 506), gradients for the Layer L-3 parameters may be computed 508, which can be returned 509 to the update 501 to compute new parameter updates.

[0090] The following example illustrates how the methods and apparatuses described with respect to FIG. 5 may be practiced. In this example, the number of nodes in each layer (N L ,. . . , N L-3 ) is assumed to be N. Also, the following index notation of the weights is used: In index i denotes the output of node i in layer l-l, and index j denotes node j in layer l.

The output of the Layer L relative to the activations of Layer L-1 is:

Instead of summing the values in the brackets (), the addends can be elements of the expanded matrix A L . Sparsity in both the activations and the weights results in sparsity in the expanded matrix, which can be exploited for computational, memory, and power savings. Further improvements can be realized by employing quantization, weight sharing, pruning, and/or constraining update values to a restricted set that enables simple processing steps to replace multiplication. It should be noted that the output of Layer L can be expressed with respect to any of the ANN parameters in preceding layers, such as argument z values, bias b values, activation a values, weights w, or g values.

[0091] The output A L of the Layer L relative to the activations of Layer L-3 is:

In the expression for the output A L of the Layer L, such as the expression shown above relative to the activations of Layer L-3, certain numerical computational efficiencies can be achieved, such as (but not limited to) exploiting similarities in the expression so similar products and sums are not repeated, quantizing parameter values (including setting small values to zero), enabling different parameters with the same value to share the same memory location, and/or possibly by employing parameter values from a restricted set (such as powers of two, ternary values, binary values, and the like) such that products may be implemented by bit-shifting, sign changes, skipping operations when there is a multiplication by zero, as well as others.

[0092] In one example, since each operation in the brackets {·} appears multiple times in each (i) expression for . and appears across the multiple expressions, the numerical computation for a given expression in {·} might be performed once, and the result used wherever that expression appears. In a set of expressions (e.g., . or other expressions disclosed herein), operations (e.g., in the brackets {·}) may be provisioned as pointers to a common memory location where a numerical value for the operation is stored. When one or more parameters in the operation are updated, the operation's corresponding numerical value is updated, and the expression(s) can be automatically updated. A parameter update may be provisioned from a restricted set, such as to update the numerical value via a simple operation (e.g., a bit shift, an additive update, and/or a sign change). In the case of a ReLU activation scaling function update, the operation's pointer can be directed to a zero value stored in memory , or to the memory location where a previously computed numerical value for the operation is stored . When the set of expressions is updated accordingly, any terms (e.g., elements) that are scaled by zero (e.g., pointers to the stored zero value) can be omitted from the computation of the expression.

[0093] In one example, the expression for shows that {·} is multiplied by the multipliers , and the resulting products are summed. In one aspect, the multipliers are summed, and a product of the sum with the numerical computation of {·} can be performed. In another aspect, the multipliers are selected to be from a restricted set of values such that each product of a multiplier with the numerical computation of {·} can be performed by bit shifting, changing the sign, or selecting/de-selecting (e.g., disregarding) the numerical computation of {·}. With respect to for example, level L-1 might employ a ReLU activation function, so is zero (which de-selects the computed value {·}) or one (which has no effect on the multiplication). The weight might be selected from a set of ternary values (-1,0,1), so a value of -1 changes the sign of the computed value {·}, zero omits {·}, and one has no effect. Layer L might employ atanh activation configured to be piecewise linear with parameters that are restricted to powers of two. Thus, the activation can be implemented by bit-shifting the computed value . Aspects disclosed herein can employ or be employed in any of the expanded matrix types described herein, including (but not limited to) A L and A L (p) matrices.

[0094] In one aspect, the expanded matrix A L comprises the addends of the vector elements . In another aspect, the expanded matrix A L (p) comprises the scaling factors of parameter(s) p. For example, A L (p) = A L (P) -1 . where P is a diagonal matrix of the same dimension as A L , with diagonal elements (indexed by (j,j)) equal to p corresponding to column index j of A L where p occurs in A L ’s elements; and the other diagonal elements of P are one. The elements of A L or A L (p) may be computed numerically. The computed matrix A L may be employed in a multiplicative update, such as where an updated p comprises a scaling factor p that multiplies a previous p, and the scaling factor p is used to multiply the numerical elements of (computed matrix) A L computed from p. In another aspect, the computed elements of A L (p) that are scaling factors of p can be multiplied by a change in (i.e., update to) p, (i.e., Dr), and the product can be added to a previous computed expanded matrix, e.g., A L = A L (p)P. It should be appreciated that the parameters p can be ANN parameters from one or more of the layers.

[0095] In some aspects, elements of A L or A L (p) can be grouped by a predetermined parameter set p, and each group of elements corresponding to a particular parameter can be summed. By way of example, the elements of A L can be grouped with respect to like parameters , and the grouped elements can be summed. In this case, the first column of A L might comprise products with , and the N-1 th column of A L might comprise products with A L whose elements are grouped with respect to parameters p is denoted as A L {p}. Thus, A L whose elements are grouped with respect to is denoted as can be an NXN matrix. The matrix denotes the expanded A L {a L-3 } matrix of coefficients of .

[0096] By way of example, elements in each row of A L that are products with (shown in bold) can be summed, and the sums can provide the elements of column 0 in A L . Column 1 's elements can comprise sums of elements corresponding to , and so on up to column (N- 1)’s elements comprising sums of the elements corresponding to :

[0097] For example, in the expression of , terms that comprise parameter can be summed: which can be expressed via arranging and grouping terms to produce a product of sums, such as: in the expression of , the element corresponding to parameter can be expressed as:

The above disclosure can be generalized for different parameters. The terms in brackets (·) may appear in other expressions in elements of the expanded matrix, and possibly in other expanded matrices, so these terms can be computed once, stored, and reused where appropriate. Computations can be simplified when at least some parameters are zero and/or small enough to be approximated as zero, such as to bypass at least some products and/or sums. In some aspects, elements of A L or A L (p) can be grouped according to multiple sets of parameters.

[0098] In some aspects, an update to a parameter in one layer can cause an update to at least one parameter in a subsequent layer in the direction of propagation. An update to might be made, and then the effect of the update on the argument of any activation functions in the subsequent layers can be made to determine if the approximated activation function needs to be updated 503, 505. The update 503, 505 to the approximated activation function can be implemented as a linear update 504, 506 to the corresponding numerically computed elements in A L or A L (p). For example, if it is determined that the update to causes an update to , the first element (column 0) in the expression of can be updated with an additive update according to the expression: and the first element (column 0) in the expression of can be updated with an additive update according to the expression:

In some aspects, the above updates due to may comprise the updated value of . In some aspects, such as when the update to is small, the above updates due to may employ a previous value of . Updates 504, 506 to multiple columns in A L or A L (p) can be made, as necessary. It should be appreciated that numerical computations for some expressions can be performed once and reused where the same expression appears in other elements of A L or A L (p). Products of values that are close to zero may be omitted for the purpose of simplifying the update computations, such that the improvement in computational efficiency outweighs the loss in accuracy. Similarly, summands that are close to zero may be omitted.

[0099] In another example, the update to causes an update to , so the first element (column 0) in the expression of can be updated with an additive update according to the expression: and the first element (column 0) in the expression of can be updated with an additive update according to the expression:

In a ReLU, there is a single breakpoint where the argument equals zero, and .

In many aspects, the updates 504, 506 to downstream activations are an infrequent occurrence. In many aspects, the complexity of implementing such updates is dramatically reduced by virtue of many node activations being zero and/or close enough to zero to be disregarded. Network sparsity can be exploited whereby a number of synapse weights w i , j are zero and/or close enough to zero to be disregarded. In some aspects, where any of the multipliers are zero (or very small) in the above expressions, the corresponding multiplications can be omitted. Thus, updates might comprise only a few sums and products, Δnd updates might be made to only a subset of the elements. Some of the nodes may be removed, which can be effected by setting the corresponding node's activation to zero.

[0100] The expanded output A of Layer L can be represented as: where each comprises a linear combination of ANN parameters, and the output vector a L of Layer L is:

Or equivalently, each of vector a L is:

A L may be expressed with respect to a particular ANN parameter set, p j where each denotes a scalar multiplier (which comprises a linear combination of ANN parameters) that multiplies parameter p j , which are shown as diagonal elements of the diagonal matrix shown above. Thus, some aspects can compute the values in A L with respect to a parameter set. p,. and compute an update to A L based on an update to one or more parameters, p j . In one aspect, one or more of the parameters p j is an additive update Δp j (for example, Δp 1 ), and an additive update ΔA L or Δa L (or an update to the error, loss or cost) might be computed from:

The value of each additive update Δp j can be selected to simplify computations. For example if the values were previously computed, Δp i can be selected to be a product with some 10 -n (where n is some integer) for example, which can be implemented via a shift in the decimal point of each of the previously computed values . In some aspects, \p, can take the form 2 -n , which can be computed via bit shifts on each of the previously computed values . Parameters, such as weights and/or piecewise linear approximations of the activation functions, can be quantized to provide for 10 -n or 2 -n steps, for example.

[0101] In some aspects, one or more of the parameters p i is updated via a scaling factor α i . (for example, scaling factor ai that multiplies parameter p 1 . The parameters, such as weights and/or piecewise linear approximations of the activation functions, can be quantized to provide for 10 -n or 2 -n steps in the scaling factor α i , for example, such as to provide the advantages described above. The updated A L can be expressed as: or, with respect to a previously computed A L ,

[0102] In some aspects, the gradients can be computed, such as

In one example, the update to a L due to additive update can be computed as: Alternatively, multiplicative updates can be employed.

[0103] In other aspects, the above descriptions can be adapted to update the ANN model with respect to parameters at layer L-2, L-3, etc. These updates can be made with respect to changes to any of the parameters and/or . Similarly, the model updates can be made with respect to changes to any of the L-3 parameters and/or . As in the case of L-2, A L can be expressed with respect to L-3 parameters, such as :

Disclosed aspects can employ any of the , in a similar manner with respect to their corresponding parameters. The derivatives can be employed for selecting corresponding parameter updates.

[0104] In one aspect, a change in the slope of a non-linear activation function in an ANN can be implemented as an update to the linear-approximation model of the ANN. For example, at node 0 in Layer (L-l), an update to the activation function , e.g., such as may be due to a change in , can be modeled as an update to an activation function's scaling factor, , or the node's output, )· In accordance with some aspects of the disclosure, accounting for the nonlinear behavior of can be implemented as an update to the linear model, such as expressed by a matrix A L , and the update can comprise a scaling of, or additive update to, a column in A L . Alternatively, the update can be implemented as an additive update to the output vector a L . As the activation functions are being approximated as linear or piecewise linear functions, and a linear model is used to compute the gradients, this can enable nonlinear activation functions that are not easily differentiable to be used in ANNs.

[0105] If is a ReLU, the activation function can transition from linear (e.g., slope = 1) to zero (e.g., deactivated). Thus, can be updated to zero. This update has the effect of removing column 0 from A L , and can be implemented in various ways, including explicitly removing column 0 from A L , skipping multiplications and/or additions of values from column 0, or by performing an additive update to a L or the errors, for example. Similarly, a node can be removed, or turned off, such as by setting its activation scaling factor or output to zero, or by zeroing corresponding weights in the next layer.

[0106] Considering Layer (L-1), vector z L-1 can be expanded to Z , such as: wherein weight matrix W L-1 operates on the expanded matrix A L-2 , whose diagonal elements are the Layer (L-2) output values, e.g., . The vector z L-1 can be obtained by summing the elements in each row of Z L-1 .

[0107] In one aspect, in layer L-1, a change to the argument of an i th node's activation function due to a change in the j th output value is computed as the change in the value in the j th column of the i th row of Z L-1 . The change in is scaled by weight In another aspect, in layer L-1, a change to the argument of an i th node's activation function due to a change in the weight is computed as the change in the value in the j th column of the i th row of Z L-1 . The change in weight is scaled by . The change or the updated argument can be compared to one or more threshold values to decide if the j th node's should be updated. In some aspects, is updated based on a change in the bias.

[0108] The expansion of terms can be employed, and each term in A L will be expanded to up to N terms with respect to layer L-l, and the resulting expression for A L can be denoted as A L (L-1). Similar increases in the number of terms in A L occurs with each subsequent layer expansion (e.g., layer L-2, layer L-3) in A L . However, the number of additional non-zero terms due to an expansion can be significantly reduced by exploiting sparsity in nodes and/or weights.

[0109] In some aspects, only a subset of the columns in A L may be employed for computing gradients and/or updates, and that subset is further reduced with sparsity. In one example, the gradient with respect to the activation function can be computed by expressing as a function of in A L (L-1). In one example, a change in bias may cause a change in , which can be computed from . such as via a recursion function that computes the activation function or the slope of the activation. From this, a linear approximation of may be computed. Then can be computed from · . Also, the derivative of A L with respect to any of the weights can be computed. In some aspects, computed from A L can comprise a single column. The derivatives of a L with respect to any of the parameters can be derived by summing the rows of the corresponding derivatives of A L . Furthermore, the derivatives of the errors and/or the derivatives of the total cost with respect to each parameter can be derived from corresponding derivatives of cl·.

[0110] FIG. 1C is a flow diagram that illustrates some method and apparatus aspects configured in accordance with aspects of the disclosure. Data preprocessing 111 can be performed before weight initialization 112. A forward pass 113 through the network may include batch norm processing 114. A total cost may be computed 115, and ANN parameters can be updated 116. A loop may be performed to repeat a process from 113 to 116 until one or more criteria are met.

[0111] The disclosed ANNs can include any type of sensor(s) for providing input data to the ANNs. Data may be preprocessed 111 and input for training, testing, or online run. Data normalization may be employed, such as to normalize training and test sets. Data augmentation may be employed to increase the size of the training set. The data can be formatted for batch or mini-batch processing. A subset of training examples can be used in each mini-batch. Each mini-batch might be processed on a different processor. Various pipelined and parallel-computing architectures may be employed. Each mini-batch produces a corresponding ANN output Y. [0112] If the mini-batch size is m, then a batch gradient descent may be performed wherein all the examples are processed before each parameter update. If the mini -batch size is 1, this is a Stochastic gradient descent, wherein every training example is its own mini-batch. Minibatch size can depend on the size of the training set and the size of memory. Mini-batch size may be set to a power of two for optimal processing efficiency.

[0113] In one aspect, forward propagation 113 to cost computation 115 is performed for each mini -batch. Parameter update 116 can include computing gradients of the cost, and updating the parameters based on the gradients and learning rate. One pass through the training set is referred to as one epoch. The learning rate may decay with respect to a function of the number of epochs. This decay rate is a hyperparameter that can be tuned.

[0114] Disclosed aspects can employ supervised learning, such as with structured data or unstructured data. During a learning phase, input data x may comprise labeled data. For example, input data x may be paired with actual (e.g., labeled, true) output y. An input feature vector x of length N is constructed for each training set. Supervised learning can employ offline data and/or online data. For offline data, a number m of training sets {(x (1) ,y (1) ),... ,( x (m) ,y (m) )} is precomputed. For online data, the actual output y can be computed for each input x. In aspects wherein a linear model computes y from x, partial update methods and/or expanded matrix methods can be employed to simplify computational processing for calculating y. Because training is iterative, algorithms that speed up computations are important. Thus, disclosed aspects can be configured to approximate nonlinear systems as linear systems, and then exploit simplifying linear computational processing to speed up computations.

[0115] In weight initialization 112, the ANN's parameters (e.g., synapse weights and biases) are initialized before training. The weights can be initialized randomly, but the bias doesn't need to be initialized randomly. A weight matrix w corresponding to a particular layer / has dimension i×j. where i is the number of nodes in the layer /, and j is the number of inputs to each node. If some nodes have fewer inputs than others, the weight matrix w can be a sparse matrix. A corresponding bias vector has dimension i × 1. Initialization can comprise setting the variance of random initialization weights based on the type of activation function employed at each node. [0116] The forward pass 113 produces a prediction set that is compared to the actual, or true labely (i) (indicated here for an i th training set) to produce an error, or loss. At each node in each (/) hidden layer, the output from the previous layer is weighted and summed, and a bias may be added: where [/] denotes layer /, (1) denotes the first training example (and subsequent training examples might be denotes by 2,... ,m ), and the subscript denotes the node in the layer /. Then an activation function f(·) at each node can be performed on the corresponding sum:

Disclosed aspects can employ scaling terms to provide linear approximations of the activation function f(·), and which can be configured to operate separately on each component of .

[0117] Batch norm 114 may be employed. For example, in a hidden layer, z [/] can be computed from inputs (activations) a [l-1] and weights w [/] , and then batch norm parameters β [/] and γ [/] can be employed to compute the batch norm z [/] BN (the mean- and variance- normalized z [/] ) from z [/] , followed by performing the activation function. Thus, the parameters for each layer can be w [/] , β [/] , and g [/] , where b is different than the hyperparameter b used for momentum. In one aspect, batch norm can be implemented in the following steps shown in FIG. 4C:

1. Compute the mean ,

2 compute the variance

3. compute the norm where e is a small constant that accounts for ,

4. compute , where and are leamable parameters. The variance might be adaptable, and can differ for different activation functions, or for other purposes.

5. Compute Δz [/](i) BN (based on the above steps, for one parameter updates at a previous layer) 421. 6. Determine if Z [/](I) BN is within the linear range of activation function, or with a quantization zone (e.g., between breakpoints) 422.

7. A) If Yes: Then a previous activation scaling factor g l (u = previous) is selected for use as g l .

B) If No: Then a new activation scaling factor g l (u = new) is computed or selected from a PWL approximation of the activation function 423.

8. Compute update to an expanded matrix A L using .

10. Update parameter(s) 425, which can be based on total cost computed from A L .

[0118] In some aspects, the term can be implemented as a scaling update to the linear model (as either a multiplicative or additive update, for example), and the term can be implemented in an update to the linear model.

[0119] At test time, μ and σ 2 can be computed using exponentially weighted averages across mini-batches. For example, μ {1} can be computed for first mini-batch,... , and μ {1} computed for m th mini-batch, and the running average can be computed. This process can also be performed for σ 2 , and the exponentially weighted averages of m and σ 2 can be used to compute . Similarly, some aspects can compute exponentially weighted averages across mini-batches for the activation scaling factors g l . In some aspects, updates to (such as computed from mean and variance) can be incorporated into updates to β [/] and/or γ [/] . The updates to β [/] can be implemented as corresponding updates to the activation scaling factor g [/] employed in a linear update model. Some aspects can normalize z [/] (e.g., before activation function), and some aspects can normalize a [/] after activation.

[0120] At a given layer /, an exemplary aspect computes the change (δβ [/] ) in the loss or cost function with respect to β [/] , and/or computes (dg [/] ) in the loss or cost function with respect to g [/] , and then computes an updated β [/] : β = — αδβ [/] and/or γ [/] : γ = — αδγ [/] where a is the learning-rate hyperparameter, using gradient descent, adam, RMSprop, momentum, or some other technique, to update the parameters β [/] and g [/] , which both have dimension (n [/] , 1).

[0121] It should be appreciated that some implementations can be employed via the following steps as shown in FIG. 4D: 1. For t = 1,... , number of mini-batches; compute the forward propagation based on input X 431.

2. In each hidden layer, batch-norm might be employed 432 to replace z [/] with z [/] BN, (e.g., normalize to mean zero and variance 1) which have normalized mean and variance.

3. Then employ methods disclosed herein to compute gradients of parameters 433 (or derivatives of the output, error, cost, loss, etc. with respect to the parameters), e.g., δw [/] , δβ [/] , δγ [l] . This can comprise configuring an expanded matrix with elements (e.g., columns) that are multipliers of the parameters, e.g., w [/] , β [/] , γ [/] . These elements can be numerically computed, and used to compute the gradients 433.

4. Updates to the parameters can be computed 434, such as via gradient descent, RMSprop, momentum, adam, or some other parameter update algorithm.

[0122] The following process can be implemented for each of the later layers, for a given parameter update at an earlier layer. For each mini-batch (e.g., mini-batch {1}), compute z [/] or Δz [ζ] from inputs a [l-1] and weights w [/] , (z [/] = w [/] a [l-1] ), compute the mean and variance for the mini-batch of the z [/] values, perform batch-norm on the z [/] values (e.g., compute the normalized z [/] , and compute z [/] BN = γ [l] z [/] + β [/] ). then determine if the corresponding activation scaling function g [,] needs to be updated (such as by determining if the z [/] BN value is still within the same quantization region as the previous z value). Note that 6 [z] can be omitted, since batch-norm zeros subtracts the mean of the z [/] values, so any constant added to z [/] is cancelled out. Update g [/] where necessary. Include the effects of the corresponding updated a [,] in later layer(s). This process can be repeated for other mini-batches. In the case of covariate shift, there might be an initial X to Y mapping, but the distribution of X may change at a later time, which can necessitate retraining the learning algorithm.

[0123] A loss function, such as , can be computed for each training example in each of the m training sets. A cost function (or total cost) J(w, b) applies to the entire training, and can be computed 115 as an average of the loss functions:

A regularization term may be added to the cost function (e.g., LI or L2 regularization), and is usually a function of the weights. The regularization affects the parameter updates (e.g., w), so L2 regularization effects weight decay in the updates. [0124] The parameters (e.g., w, and b) can be updated 116 (i.e., modified) to minimize the total cost J(w, b ). For example, the updates can be expressed by:

Where a denotes a learning rate. The partial derivatives can comprise an average of computed partial derivatives. Weight updates can include regularization.

[0125] Any optimization algorithm can be employed. Gradient descent can be employed with momentum. Exponentially weighted moving averages may be employed. For example, Another aspect may employ RMSprop:

Another aspect may employ the Adam (adaptive moment estimation) optimization algorithm. Another aspect may employ learning-rate decay, wherein learning steps get smaller as convergence is approached. Various learning-rate decay schemes may be used to change a. Since smaller steps facilitate the use of linear approximations, the model may switch to linear approximations when a threshold step size is crossed.

[0126] Dropout can be performed, such as by setting activations of selected hidden units to zero. This may be employed as a regularization scheme. For example, weights or activation functions can be set to zero. This can involve zeroing elements in the expanded matrices or changing the dimension of the matrices. [0127] In one aspect, a node is selected for dropout if its activation is in the linear range of the activation function. For example, if z is small, g(z) = tanh(z) is in its linear range. In this case, the node is said to be linear. The history of the node's activation function (e.g., relative to iterations, training examples, etc.) can be used to determine a likelihood that the activation function will remain linear (e.g., within a threshold of linearity, such as may be determined based on the difference in the linear approximation relative to the activation function) for future iterations and/or training examples. Then the linear node may be removed (dropped out), or the linear node's bias or activation function may be updated to adjust the node's activation into a nonlinear range.

[0128] If all the nodes are linear (e.g., for small changes in the argument of the activation function, and for activation functions operating in their linear range), the entire network can be approximated with a linear model. In some aspects, an initial training stage can comprise employing a linear approximation of an ANN to achieve an initial set of parameters and/or hyperparameters, which provide a solution that may have more bias than desired. Then the initial set of parameters can be employed in a nonlinear implementation of the ANN.

[0129] FIG. 6 illustrates a method whereby network quantization and weight sharing can compress a network by reducing the number of bits required to represent each weight. In one aspect, the weights in each layer's NXN weight matrix 610 can be quantized to M bins. Thus, a table 611 of M 32-bit values representing shared weights, or centroids, can take the place of N 2 32-bit values of individual weights. N 2 small indices (or pointers) 612 that point to the table 611 of M values are included. Centroid initialization can employ various initialization methods, such as linear initialization, which linearly spaces the centroids between the [min, max] of the original weights. This helps to maintain the large weights, as the large weights play a more important role than smaller ones. Since all the weights in the same bin share the same value, each weight in the NXN weight matrix 610 is replaced with a small index 612 into the table 611 of shared weights. During update, the gradients are computed, and gradients corresponding to each weight can be grouped according to the weight's index. All gradients corresponding to the same index can be summed to produce N sums, and each sum can be scaled by a learning rate, and may be quantized to a restricted set. The scaled and quantized gradients might be subtracted from their respective centroids to provide for fine- tuning of the centroid values in table 611. Aspects can use k-means clustering to identify the shared weights for each layer so that all the weights that fall into the same cluster will share the same weight. In some aspects, weights are not shared across layers. Huffman coding may be employed. The table 611 can be derived from the occurrence probability for each symbol. More common symbols are represented with fewer bits.

[0130] As described above, limiting the number of effective weights that need to be stored can be achieved by having multiple connections share the same weight, and then fine-tune the shared weights in subsequent updates. Note that PWL approximation of the activation functions can be performed to achieve both quantization and value sharing. Activation scaling factors 620 can be quantized, with indexed values stored in table 621, and the activations in table 620 can be replaced with indices to corresponding values in table 621.

[0131] Formula 630 represents a linear update operation. The weights and activations in the formula 630 can be replaced with indices to corresponding values in tables 611 and 612 to provide the algorithm in 631. The algorithm 631 can be configured to skip multiplications and/or additions that involve small values. When a value is updated, the formula may be updated via appropriate substitution. A numerical computation of the formula 631 can be stored in table 641. In some aspects, products and/or sums in the algorithm 631 may be stored in the table 641. The stored products and/or sums may be operated upon by an update mechanism to update their values according to updates to any of their terms. In some aspects, stored products and/or sums may be common to multiple formulas, so updates might be performed only once to provide updates to multiple formulas.

[0132] Fine tuning the shared weights can be performed via updates to expanded matrices. In some aspects, each formula (e.g., 631) might comprise one or more expanded matrices with respect to the values being updated. In some aspects, the updates can be constrained to a restricted set that simplifies update operations on numerically computed values in the expanded matrices. Some aspects can provide for employing the scaled and quantized gradients to operate on numerically computed values in an expanded matrix in order to compute an update expanded matrix, which can be summed with a previous expanded matrix to produce an updated expanded matrix.

[0133] As a result of weights and/or activations being zero, expanded matrices disclosed herein can be sparse. The sparse structure can be stored using compressed sparse row (CSR) or compressed sparse column (CSC) format, which requires 2a + n + 1 numbers, where a is the number of non-zero elements and n is the number of rows or columns. To compress further, the index differences may be stored instead of the absolute positions. For each column Aj of matrix A, we store a vector v that contains the non-zero values, and a second, equal-length vector q that encodes the number of zeros before the corresponding entry in v. The v and q of all columns are stored in one large pair of arrays with a pointer vector p' pointing to the beginning of the vector for each column. A final entry in p' points one beyond the last vector element. Storing the sparse matrix by columns in CSC format makes it easy to exploit sparsity. An update, for example, can simply multiply each non-zero value by all of the non-zero elements in its corresponding column. Some aspects can provide for distributing the matrix and parallelizing the matrix-vector computation by interleaving the rows of the matrix A over multiple processing elements (PEs). The interleaved CSC representation allows each PE to quickly find the non-zero values in each column to be multiplied.

[0134] FIG. 7 is a flow diagram that illustrates a process according to some aspects of the disclosure. This process can reduce memory requirements, which might enable the model to be fit into on-chip SRAM cache rather than off-chip DRAM memory. This can facilitate the use of complex neural networks in mobile applications where application size and download bandwidth are constrained.

[0135] An initial training phase learns 701 the network connectivity and produces a first set of weights. Pruning 702 eliminates connections by setting weights that are below a small threshold value to zero. The remaining weights are quantized 703, followed by fine-tuning the quantized weights 704. Huffman coding may be applied 705. Aspects can employ an expanded-matrix ANN model.

[0136] In one aspect, the initial training phase 701 might update an expanded-matrix ANN model, wherein the elements in each expanded matrix comprise numerically computed values, and the updates operate on the values. The updates may be quantized to a restricted set in order to simplify the operations on the values. In another aspect, pruning 702 may be performed in an expanded-matrix ANN model, such as to skip multiplications and/or additions that involve values which pruning sets to zero. Pruning 702 may remove or otherwise update values in an expanded matrix. Quantization 703 can involve updates to an expanded matrix. In some aspects, numerical computation of values in each expanded matrix is performed following quantization 703, and then fine-tuning 704 may include updating each expanded matrix. The updates may be quantized to a restricted set in order to simplify the operations on the values.

[0137] Disclosed linear models can be used to compute gradients, including the gradient of the prediction output, error, or total loss with respect to a change in any of the ANN parameters. These models can be used to compute updates to any of the parameters. These models can be used to update the change to the prediction output, error, or total loss as a result of parameter updates. These models can be used to determine the effect of activating or deactivating particular nodes or synapses, such as for pruning or dropout. These models can update the prediction output, error, or total loss resulting from changing the ANN topology, such as by activating or deactivating particular nodes or synapses, skipping layers, and the like. These models can update the expanded matrix A L , the prediction output, error, or total loss based on updates to the data input x. The effects of subsequent data inputs to the ANN can be modeled as updates to expanded matrices developed from previous data inputs. In some aspects, data inputs corresponding to different training examples can be grouped such that subsequent examples can employ the expanded matrices used in previous examples.

[0138] Once an updated expanded output matrix is computed, the elements in each row can be summed to generate the output vector. During training, the output vector can be compared to a vector of target values, and a cost function computed therefrom. A common objective is to compute the change in the cost function with respect to weights and biases throughout the ANN, which is typically performed by computing partial derivatives in a backpropagation and applying the chain rule. The cost function is computed from the activations of the final layer. Error in a given neuron is related to the rate of change in the cost function with respect to a change in the weight , and the rate of change in the cost function with respect to change in the bias . One objective seeks to identify changes to the neuron's input that make the cost smaller. By discovering which weights and biases affect the cost the most, disclosed aspects can more efficiently train the network. In some aspects, parameter updates can be skipped for those parameters whose affect on the cost is below a threshold value.

[0139] In another approach, by inspecting the coefficients in each column of the expanded matrix, it can be rapidly determined which weights and biases have the greatest effect on the cost function and/or discard candidate weights and biases without requiring the full computation of the cost function. A similar approach may be performed to determine which neurons to eliminate from the network such that elimination of the selected node(s) have minimal or no effect on the network's accuracy.

[0140] The total error can be computed from the squared difference of each output value minus its corresponding target value, and these squared differences can be summed over all the outputs to produce the total error. In one example, the change in the total error for an update might be a function of the change in the i th column of the expanded output matrix. Thus, computing the magnitudes in the i th column in each row can provide a quick indication if the corresponding weight or bias is a good candidate to update for learning. This amount of change corresponding to each row may be divided by the sum of the rows elements, and the row with the highest value(s) may be determined, whereby the corresponding weight or bias might be selected for updating. A change in the magnitude may be used to determined subsequent updates for the corresponding parameter(s).

[0141] FIG. 8A depicts a GPU parallel computing architecture that includes NSM levels of streaming multiprocessors (SMs) 910.1-910.N (SM 1, SM 2,... , SM NSM), each comprising a shared memory component 912, alevel ofM registers 914.1-914. M, a level of streaming processors (SPs) 916.1-916. M (SP 1, SP 2,... , SP M), an instruction unit 918, a constant cache component 920, and a texture cache component 922. There are various memories available in GPUs, which can be organized in a hybrid cache and local-store hierarchy. The memories can include off-chip global memory, off-chip local memory, on-chip shared memory, off-chip constant memory with on-chip cache, off-chip texture memory with on- chip cache, and on-chip registers. An off-chip device memory component 924 can include global memory and/or constant and texture memory. The GPU architecture can include or be communicatively coupled 901 to a CPU 904 and a CPU memory 906, which may be adapted to store computer-readable instructions and data for performing the activity of the CPU 904. The CPU 904 may be in operative communication with components of the GPU architecture or similar components via a bus, a network, or some other communication coupling. The CPU 904 may effect initiation and scheduling of the processes or functions performed by the GPU architecture.

[0142] The shared memory 912 is present in each SM and can be organized into banks. Bank conflict can occur when multiple addresses belonging to the same bank are accessed at the same time. Each SM 910.1-910.N also has a set of registers 914.1-914. M. The constant and texture memories are read-only regions in the global memory space and they have on-chip read-only caches. Accessing constant cache 920 is faster, but it has only a single port and hence it is beneficial when multiple processor cores load the same value from the cache. Texture cache 924 has higher latency than constant cache 920, but it does not suffer greatly when memory read accesses are irregular, and it is also beneficial for accessing data with two-dimensional (2D) spatial locality.

[0143] FIG. 8B is a flow diagram that is illustrative of a method, functional components of an apparatus, and code segments of a computer program in accordance with aspects of the disclosure. Input data symbols and ANN parameter are processed for generating 951 a base expanded matrix at Layer L having a plurality of rows and a plurality of columns, wherein a sum of values in each row can produce a base signal vector, such as an ANN output vector (or Layer L activation). Values in at least one column of the base expanded matrix can be updated 952 to produce an updated expanded matrix. The values in each row of the updated expanded matrix can be summed 953 to produce an updated ANN output vector.

[0144] At least one feature of the updated expanded matrix and/or the updated ANN output vector may be measured 954. If only the updated expanded matrix is measured 954, then the diagram may flow directly from update 952 to measure 954. If an updated expanded matrix meets at least one measurement criterion in 954, the rows of the expanded matrix may be summed 953. In an aspect, the measurement in 954 is used, at least in part, to control the update operation 952. In an aspect, the measurement in 954 is used, at least in part, to assign at least one updated expanded matrix as a base expanded matrix in 951, which may be subsequently updated 952, such as in an iterative or recursive process.

[0145] The above detailed description set forth above in connection with the appended drawings describes examples and does not represent the only examples that may be implemented or that are within the scope of the claims. The term "example," when used in this description, means "serving as an example, instance, or illustration," and not "preferred" or "advantageous over other examples." The detailed description includes specific details for the purpose of providing an understanding of the described techniques. These techniques, however, may be practiced without these specific details. In some instances, well-known structures and apparatuses are shown in block diagram form in order to avoid obscuring the concepts of the described examples. [0146] Information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, computer-executable code or instructions stored on a computer- readable medium, or any combination thereof.

[0147] The various illustrative blocks and components described in connection with the disclosure herein may be implemented or performed with a specially -programmed device, such as but not limited to a processor, a digital signal processor (DSP), an ASIC, an FPGA, a CPU, a GPU, or other programmable logic device, a discrete gate or transistor logic, a discrete hardware component, or any combination thereof designed to perform the functions described herein. A specially-programmed processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A specially-programmed processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

[0148] The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a non-transitory computer-readable medium. Other examples and implementations are within the scope and spirit of the disclosure and appended claims. For example, due to the nature of software, functions described above can be implemented using software executed by a specially programmed processor, hardware, firmware, hardwiring, or combinations of any of these. Features implementing functions may also be distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, "or" as used in a list of items prefaced by "at least one of indicates a disjunctive list such that, for example, a list of "at least one of A, B, or C" means A or B or C or AB or AC or BC or ABC (i.e., A and B and C).

[0149] Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, computer- readable media can comprise RAM, ROM, EEPROM, CD- ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general- purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

[0150] The previous description is provided to enable a person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the common principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Furthermore, although elements of the described aspects and/or embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. Additionally, all or a portion of any aspect and/or embodiment may be utilized with all or a portion of any other aspect and/or embodiment, unless stated otherwise. Thus, the disclosure is not to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.