Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR THE COMPUTATION OF A NARROW BIT WIDTH LINEAR ALGEBRA OPERATION
Document Type and Number:
WIPO Patent Application WO/2023/160868
Kind Code:
A1
Abstract:
The present invention relates to a method for computing a linear algebra operation of two operands or arrays comprising one or more narrow bit width elements with a digital circuit. The method uses the principle of binary segmentation to reduce the computation overhead of linear algebra operations like linear convolution and inner product of operands such as vectors with narrow bit width components. The invention is also directed to a digital circuit configured to execute the method.

Inventors:
REGGIANI ENRICO (ES)
UNSAL OSMAN (ES)
CRISTAL KESTELMAN ADRIÁN (ES)
Application Number:
PCT/EP2022/087935
Publication Date:
August 31, 2023
Filing Date:
December 28, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BARCELONA SUPERCOMPUTING CENTER CENTRO NAC DE SUPERCOMPUTACION BSC CNS (ES)
International Classes:
G06F7/544
Foreign References:
EP0602888A11994-06-22
EP0486143A21992-05-20
Other References:
YAO FU ET AL: "Deep Learning with INT8 Optimization on Xilinx Devices", WHITE PAPER: ULTRASCALE AND ULTRASCALE+ FPGAS, 24 April 2017 (2017-04-24), pages 1 - 11, XP055493137, Retrieved from the Internet [retrieved on 20180717]
Attorney, Agent or Firm:
ESCUDERO PRIETO, Nicolás (ES)
Download PDF:
Claims:
CLAIMS

1. Method for the computation of a linear algebra operation of two operands u, v, by means of a digital circuit operatively connected to an arithmetic-logic unit; wherein the method comprises, by the digital circuit, the steps of: receiving (10) two operands u, v, composed of m, n narrow bit width elements, respectively, such that u = [u0, ], the elements having bit widths bu, bv, respectively, extending (20) the bit width of each element of the operands to a clustering bit width cw by adding zeroes to the left of the most significant bit of the element, and concatenating the elements of each operand to obtain the values U, V, of bit widths cw m and cw n, respectively, wherein the clustering width fulfils the condition: cw > max (bu, bv) feeding (30) the values U, V, to the arithmetic-logic unit, receiving (40) a result value W of bit width p from the arithmetic-logic unit, W =

[w0. Wp. , extracting (50) an operation result w from the result value W, wherein the operation result w comprises one or more bits of W.

2. Method according to the previous claim, wherein the value W from the arithmetic-logic unit is obtained by multiplying, or adding the values U and V.

3. Method according to any of the previous claims, wherein the clustering width fulfils the condition: cw > bu + bv + log2(max(m, n)).

4. Method according to the previous claim, wherein after receiving (10) two operands u, v, the method comprises the step of reverting (11) the order of the elements of the operand v = [v0, .... Vn- J to obtain an operand v’ = [vn-1; .... v0 ].

5. Method according to claim 3, wherein each of the elements of operand v equals one:

V0 = ••• = Vn- = 1. 6. Method according to any of previous claims, wherein the operation result w comprises one or more elements, w = [wo,^, ... ], and wherein each element Wj comprises one or more bits of W according to:

7. Method according to any of claims 4-5, wherein the operation result w comprises one or more bits of W according to:

8. Method according to any of claims 1-2, wherein the clustering width fulfils the condition: cw > bu + log2(m).

9. Method according to the previous claim, wherein, n = 1 and v = [v0 ] = [ ]■

10. Method according to any of previous claims 1-2, wherein the clustering width fulfils the condition: cw > max (bu, bv) + 1.

11 . Method according to the previous claim, wherein, after the step of extending (20) the bit width of each element, the method comprises the step of computing (21) the 2’s complement of the elements v0, ... , of operand v.

12. Method according to any of claims 10-11 , wherein the operation result w comprises one or more elements, w = [wo,^, ... ], and wherein each element Wj comprises one or more bits of W according to:

13. Digital circuit configured to execute the method according to any of claims 1-12, wherein the digital circuit comprises: a pre-processing unit, configured to receive (10) and extend (20) operands u, v, and to feed (30) values II, V to an arithmetic-logic unit, and a post-processing unit, configured to receive (40) a result value W from the arithmetic-logic unit and extract (50) an operation result w.

14. Digital circuit according to previous claim, further configured to be operatively connected to an arithmetic-logic unit of a computer, and further configured to feed (30) or receive (40) values II, V, W to or from the arithmetic-logic unit through a register file, a by pass logic and/or a memory hierarchy.

15. Digital circuit according to claim 13, wherein the pre-processing unit and the postprocessing unit are implemented on a configurable logic block, CLB, wherein the CLB comprises registers and look up tables, LUT.

Description:
METHOD FOR THE COMPUTATION OF A NARROW BIT WIDTH LINEAR ALGEBRA OPERATION

OBJECT OF THE INVENTION

The present invention relates to a method for computing a linear algebra operation of two operands or arrays comprising one or more narrow bit width elements with a digital circuit. The method uses the principle of binary segmentation to reduce the computation overhead of linear algebra operations like linear convolution and inner product of operands such as vectors with narrow bit width components. The invention is also directed to a digital circuit configured to execute the method.

BACKGROUND OF THE INVENTION

Contemporary Internet-of-Things (loT), edge and mobile computing applications require high-performance. This demand is fuelling a large research effort in low-power, high- performance embedded processors. Such devices, mainly constrained by power and cost, have to fulfil the performance and memory requirements of a vast collection of important application domains, such as deep learning, robotics, graph processing, and cryptography. Most of these application classes represent data as matrices and vectors, and express their computation through a set of linear algebra kernels. When targeting edge platforms, a popular approach to reduce energy demands and memory footprint requirements is to compact the data layout using a smaller data format while preserving the application accuracy.

On the one hand, expressing and computing data exploiting low-precision floating-point formats is gaining traction in the High-Performance Edge Computing (HPEC) community, as they represent a good trade-off between data size and accuracy. On the other hand, narrow fixed-point and integer data representations (i.e. , byte and sub-byte) offer a better alternative in terms of Performance per Watt ratio, although they feature smaller number representations. One of the dominant applications of edge computing that leverages these compressed data formats is the Quantized Convolutional Neural Network (QCNN) inference, which exploits quantization to represent data and weights with data sizes typically ranging from eight to one bit with tolerable accuracy penalties. Other applications belonging to important application classes for edge computing, such as graph computing and cryptography, widely rely on Boolean matrices and vectors computations to traverse a graph or to encrypt/decrypt a message. These applications would greatly benefit from hardware and software solutions capable of efficiently computing narrow integer linear algebra kernels.

Porting the computation of these applications from cloud to edge and mobile devices would enable significant improvements in terms of security, safety, and energy efficiency. However, despite their low memory and energy demands, their intrinsically high computational intensity makes the computation of these workloads challenging on highly resource- constrained devices.

Therefore, there is a demand for a lightweight and high-performance method and architecture aimed at increasing the efficiency of linear algebra and narrow integer computations on edge processors.

DESCRIPTION OF THE INVENTION

In a first inventive aspect, the invention provides a method for the computation of a linear algebra operation of two operands u, v, by means of a digital circuit operatively connected to an arithmetic-logic unit; wherein the method comprises, by the digital circuit, the steps of: widths b u , b v , respectively, extending the bit width of each element of the operands to a clustering bit width c w by adding zeroes to the left of the most significant bit of the element, and concatenating the elements of each operand to obtain the values U, V, of bit widths c w m and c w n, respectively, wherein the clustering width fulfils the condition: c w > max (b u , b v ) feeding the values U, V, to the arithmetic-logic unit, extracting an operation result w from the result value W, wherein the operation result w comprises one or more bits of W. The proposed invention relies on the principles of a mathematical technique called binary segmentation, which considerably decreases the arithmetic complexity of a linear algebra computation of operands with narrow bit width elements, while reducing the processing overhead and memory footprint of the computation.

Throughout the present document it will be understood that the method is intended to perform a linear algebra operation, preferably a vector linear convolution, an inner product, a vector reduction, a scalar-vector product, or a vector addition or subtraction, on two operands with narrow bit width elements; by operand with narrow bit width elements should be understood an array of elements, wherein each element is a narrow bit width binary integer; by narrow bit width should be understood a bit width which is smaller than the word size of the architecture of the arithmetic-logic unit.

According to the method, the digital circuit performs a series of transformations on the source operands u, v, transforms them into values II, V, and feeds them to an arithmeticlogic unit, which performs an operation on the values II, V, preferably a multiplication or an addition, yielding a result value W. The result value W is further transformed to extract the result of the operation, w; preferably, the extraction comprises outputting a set of bits of the result value W, for example, by means of a masking operation. Each operation requires a specific pre-processing of the operands and a specific extraction process, as will be explained below.

The digital circuit should be understood as any hardware element capable of performing the transformations of the operands and values required by the method; preferably, the digital circuit is configured to be connected to a known computing system. In particular, the digital circuit is preferably configured to receive and output operands or values from any data storage means, which in a preferred example are the registers of a computer’s processing unit. Also preferably, the digital circuit is connectable, directly or through other elements, to the arithmetic-logic unit of a computer’s processing unit. In other embodiments, the digital circuit is configured to be connected to digital circuits capable of storing data and to digital circuits capable of multiplying and/or adding two binary operands.

In a particular embodiment, the value W from the arithmetic-logic unit is obtained by multiplying, or adding the values U and V. The most common linear algebra operations can be reduced to two basic computations, namely, multiplication and addition. Advantageously, binary segmentation allows to reduce the linear algebra operations of arrays of elements, which normally would require several computation cycles for each element of the array, to a multiplication or addition of only two values, thus reducing the number of cycles necessary to complete the computation.

Moreover, the method allows for the computation of three very significant operations on vectors, namely linear convolution, inner product and vector reduction. Generally, the linear convolution of two vectors is computed as follows:

While the inner product is computed as follows:

Finally, the vector reduction returns a scalar value holding the sum between all the elements of u; in a way, the vector reduction is a variant of the inner product of the vector to be reduced with a vector of ones follows:

However, in conventional computations, array operations are computing intensive, and require substantial processing cycles and use of memory for storage of intermediate results. This is avoided thanks to the present method implementation of binary segmentation, which requires the extension of each element of the array, for example, of each component of a vector, to a certain size, the clustering width, c w ; the extension is performed by adding zeroes to the left of the most significant bit, thus the value of the integers does not change, whereas the subsequent concatenation of the c w -normalized elements produces a value, II or V which can be processed by the arithmetic-logic unit without loss of information.

In a particular embodiment, the clustering width fulfils the condition: c w > b u + b v + log 2 (max(m, n)).

The clustering width for the computation of linear convolution, inner product and vector reduction is obtained according to the bit width of the elements of the operands, b u , b v , and the number of elements of each operand, m, n. This process provides a clustering width which is defined to be greater than the actual bit width of the input elements, b u , b v , as it includes extra guard-band bits to avoid overflows in the segmented data due to carry propagation. This allows performing the summation or multiplication of n narrow integers with only one summation or multiplication of two long integers, instead of n summations or multiplication of short integers. It is worth noticing that binary segmentation is not an approximate computing technique, since it guarantees exact computations, as the clustering width dimension already accounts for the number of bits needed to represent the computation output without introducing precision losses.

In the particular embodiment of the inner product computation, after receiving two operands u, v, the method comprises the step of reverting the order of the elements of the operand v = [v 0 , ..., v n-x ] to obtain an operand v' = [v n-1; ..., v 0 ]. This step is also advantageously applied to the computation of a linear convolution for convolutional neural networks (CNN).

The computation of the inner product, or dot product, of two vectors involves the reversion of the order of the elements of one operand, either u or v; the reversed operand v’ is transformed as general case of the method, that is, zeroes are added to the left of the most significant bit, and the inner product is obtained from the multiplication of the resulting values U, V.

In the particular embodiment of the vector reduction, each of the elements of operand v equals one:

V 0 = ••• = Vn- = 1.

As mentioned above, the vector reduction is processed as a special case of the inner product via binary segmentation between u and v, with v being a unit vector composed only by elements equal to one. The result is reducing vector u, composed of n elements, to a scalar value holding the sum between all the elements of u.

In the particular embodiment of linear convolution, the operation result w comprises one or more elements, w = [wo,^, ... ], and wherein each element w, comprises one or more bits of W according to:

In the method of the invention, once the values II, V have been multiplied, the product, W, must be transformed to obtain the operation result. For the linear convolution, the operation result is an array of elements, and each element comprises a set of bits of W; in particular, for linear convolution, each element Wj comprises a set of c w consecutive bits of W, or conversely, if W is evenly divided into a number of sets of c w consecutive bits, Wj comprises one of those sets, for instance, for wo:

W o — [VF Cw -l< ..., Wo]

Which corresponds to the rightmost set of c w bits of W The next element, wi, would comprise the bits W cw to a-cw-i, and so on. These elements are alternatively known as slices.

However, for operations other than linear convolution, the extraction of the result w is slightly different. In the particular embodiment of the computation of inner product and vector reduction,

For the inner product and the vector reduction, the extraction of the result involves extracting a set of c w consecutive bits from the value W starting from the fm-7 Cwth bit of W, from the least significant bit, or in other words, a central slice of W

Also, the method considers a particular embodiment of the computation of the product between a vector and a scalar, in which n = 1 and v = [v 0 ] = [ ]■

For the computation of a vector and a scalar, the operands are construed respectively as a multi-element array and a single element array, which value corresponds to the scalar operand, or in other words, m > 2, n = 1.

The clustering width needs to be adapted accordingly, and that particular embodiment, the clustering width fulfils the condition: c w > b u + log 2 (m).

It has been stated that the method is also suitable for the computation of array additions and subtractions. For the addition or subtraction according to binary segmentation, the clustering width fulfils the condition: c w > max b u , b v ~) + 1. While the addition of two binary operands is straightforward, and can be computed by an adder, the subtraction of two operands implies computing the 2’s complement of the subtracted operand. In the embodiment corresponding to a subtraction operation, the method comprises the step of computing the 2’s complement of the elements v 0 , v n -i of operand v.

The 2’s complement, or two’s complement, operation is an operation known in the field on binary numbers calculated by inverting the bits and adding one to the original binary number, and amounts to computing the negative of the original number; thus, the subtraction of two operands is computed as the addition of a first value and the negative of the second value.

For the addition and subtraction as well as for the scalar-vector product, the operation result

This extraction process is analogue as the process for the linear convolution result extraction, differing only in the number of elements of the result w.

In a second inventive aspect, the invention provides a digital circuit configured to execute the method according to the first inventive aspect, wherein the digital circuit comprises: a pre-processing unit, configured to receive and extend operands u, v, and to feed values U, V to an arithmetic-logic unit, and a post-processing unit, configured to receive a result value W from the arithmeticlogic unit and extract an operation result w.

As discussed above, the method is intended to be executed by any digital circuit capable of performing the described transformations of the operands; such a digital circuit may be implemented by means of an assembly of interconnected conventional logic gates, or may be implemented by means of a purposefully built printed circuit board or integrated circuit. In all of the cases, the digital circuit may optionally be provided with an arithmetic logic and/or input and output connections and/or data storage means to feed, store and multiply or add the values, so as to create an integral binary segmentation computing unit configured to receive operands and output an operation result. The digital circuit comprises two process-oriented units, a pre-processing unit and a postprocessing unit, each one configured to carry out transformations before and after the addition or multiplication, respectively. Depending on the specific architecture of the digital circuit, some elements of the circuit may be shared between the pre-processing unit and the post-processing unit, or may be used in combination with other elements to perform one or more of the steps of the method.

In an embodiment of the digital circuit, the pre-processing unit comprises an extend unit, configured to extend the bit width of the elements of the operands to the clustering width, and a pack unit, configured to compress its input data into their actual data sizes by converting the elements of the operands into the target output bit width, and forwarding them to a mask unit to merge them. This mask unit is an element of the post-processing unit, and it is also configured to extract the relevant operation result w from the result value W. The digital circuit may further comprise a control unit with a control register, configured to determine the clustering widths and other key parameters according to the input operand bit widths and operation type.

In a particular embodiment, the circuit is further configured to be operatively connected to an arithmetic-logic unit of a computer, and further configured to feed or receive values U, V, W to or from the arithmetic-logic unit through a register file, a by pass logic and/or a memory hierarchy.

In this embodiment, the digital circuit is configured as an add-on, or connectable device, configured to be connected to a pre-existing processing unit or comparable computing element, so that the digital circuit may be advantageously retrofitted on any computer. In this embodiment, upon connection with the computer, the digital circuit loads the values II, V, resulting from the transformation of the input operands u, v, into the registers of the arithmetic-logic unit, possibly together with an instruction relative to the operation, and receives the result of the computation, W. The digital circuit may feed and receive these values through direct buses to the arithmetic-logic unit, or may feed and receive them through other elements, such as register files, by pass logic, or any type of data storage means, including any level of the memory hierarchy. In some processor architectures, for example, CPU’s and GPU’s, the digital circuit behaviour is managed by instructions specific to those architectures. In a particular embodiment, the pre-processing unit and the post-processing unit are implemented on a configurable logic block, CLB, wherein the CLB comprises registers and look up tables, LUT.

A particularly advantageous application of the digital circuit are data flow computer architectures. Examples of data flow architectures on which the digital circuit may be applied are architectures implemented by field programable gate arrays (FPGA) comprising a plurality of programable logic units, or configurable logic blocks (CLB), interconnected through an interconnexion matrix. In those embodiments, the digital circuit may be built with one or more CLB and digital signal processing units DSP, and the pre-processing and post processing units are implemented by means of look up tables (LUT).

DESCRIPTION OF THE DRAWINGS

The foregoing and other advantages and features will be more fully understood from the following detailed description of exemplary embodiments with reference to the accompanying drawings, which should be considered by way of illustration and not limitation, in which:

Figure 1a, 1 b, 1c These figures represent flow diagrams of three embodiments of the method.

Figure 2 This figure represents a diagram of a preferred example of the digital circuit.

Figure 3 This figure represents a diagram of another preferred example of the digital circuit implemented in a FGPA architecture.

PREFERRED EMBODIMENT OF THE INVENTION

Throughout the present document it will be understood that various parts of one embodiment of the invention can be freely combined with parts described in other embodiments, even being said combination not explicitly described, provided there is no harm in such combination.

The method object of the present invention is directed to the implementation of a binary segmentation computation process for two array operands comprised of narrow bit width operands by means of a digital circuit. The method is capable of computing at least vector linear convolutions, inner products, vector reductions, scalar-vector products, or vector additions or subtractions; all these linear algebra operations are computed by means of a set of common steps (10, 20, 30, 40, 50), which differ from one type of operation to another in the clustering bit width of the elements of the operands, the result extraction procedure, and the specific computation by the arithmetic-logic unit (ALU), either an addition or a multiplication.

Figures 1a, 1 b and 1c show three flow diagrams corresponding to three embodiments of the method. In general, the method may be divided into three phases:

- A pre-processing phase, comprising the steps of receiving (10) two operands u, v, composed of m, n narrow bit width elements, extending (20) the bit width of each element of the operands to a clustering bit width and feeding (30) the values U, V, to the arithmetic-logic unit.

- A compute phase, which is executed by the ALU; in this phase the values U and V are either added or multiplied, and outputted to the digital circuit.

- A post-processing phase, comprising the steps of receiving (40) a result value W and extracting (50) an operation result w.

Figure 1a shows a general case, valid for any operation, while Figure 1b shows the specific case of the computation of the inner product, and Figure 1c that of the computation of the subtraction.

In a first step, the digital circuit receives (10) two operands, u, v, for example two vectors, with m and n elements, or components, respectively: u = [u 0 , ...,u m -i], v = [v 0 , V n -1 ]

If the operation of choice is the inner product, the digital circuit needs to revert (11 ) the order of the elements of the second vector, as follows: v' = [v n -i, ..., V 0 ]

Convolutional neural networks (CNN) apply a type of linear convolution which can be also computed with the method of the present invention; this operation is calculated in the same way as the general case of linear convolution, but differs in that the order of the elements of one of the vectors is reverted (11) as in the case for the calculation of the inner product. The next step requires the digital circuit to extend (20) the bit width of each element of both vectors to a clustering width, c w . The target clustering width is achieved by adding zeroes to the left of the most significant bit, and the clustering width is defined so as to space the elements in the concatenated value (II or V), and avoid overflows in the computation. The clustering width, c w , is an integer fulfilling the condition: c w > max (b u , b v )

Where b u , b v , are the bit widths of the elements of u and v, respectively. To better exploit the benefits of the binary segmentation, the clustering width needs to be the smallest possible integer that provides enough separation between the components of the vectors, and this in turn depends on the operation. In summary, for each operation, the clustering width is the smallest integer fulfilling the condition:

- For linear convolution, inner product and vector reduction: c w > b u + b v + log 2 (max(m, n))

- For vector-scalar product: c w > b u + log 2 (m)

- For addition or subtraction: c w > max (b u , b v ) + 1

It is apparent that certain operations require the same number of elements for both operands, or vectors; in such cases, it follows that m = n. Also, in the case of the scalarvector product, the scalar operand is considered as an array with a single element with the value of the scalar, k, and the rest of the process does not change.

Moreover, if the operation of choice is a subtraction, the digital circuit computes (21) the two’s complement of the elements of the subtracted vector, in this case, v, after expanding (20) to the clustering width.

After expanding the elements to a normalized clustering width, the digital circuit concatenates the elements to obtain two values II, V, of bit widths c w m and c w n, respectively. When the operation is a subtraction or an inner product, the second vector v is transformed as explained above, and then expanded and concatenated as in the general case. The two values II, V, are then fed (30) to the ALU, possibly together with a signal indicating the type of operation, namely a multiplication or addition, and then returned to the digital circuit, which receives (40) the result value W. The type of operation is a multiplication, except for subtractions and additions, in which case, the values U, V, are added.

The result value W is a binary integer of p bits; the size of p depends on whether U and V are added or multiplied and the bit widths of U and V. If W is written as an array of bits:

W = [Wo, ... , ^-1] the extraction (50) step requires the digital circuit to set apart or somehow output a subset of bits of W. The specific set of bits will depend on the type of operation, and is summarized as follows:

For linear convolution the result of the operation w is an array or vector, and each element Wj of the array, or slice, is defined as:

A similar principle applies to the scalar vector product, the addition and subtraction, with the difference of the number of components of w:

Finally, for inner product and vector reduction the result of the operation comprises a single slice of the result value W:

The following numerical examples shows two implementations of the method with specific numerical values.

Example 1 : inner product

Operands: u = (7,5); v = (4,2)

Clustering width: c w = 7

7 = 111 2 ; 5 = 101 2

4 = 100 2 ; (vi , v 0 ) = (010, 100)

Ucw Vcw = (0000111 , 0000101)^ V = [00001110000101]

W = U V = [000111001001100010100]

W = [W(2 cw) - 1, Wl cw] = [W , W 7 ] = [0100110]

Regarding the digital circuit, in a preferred example, shown in Figure 2, the digital circuit is implemented as a RISC (Reduced Instruction Set Computer) instruction set architecture integrated on a system on a chip (SoC), for a conventional central processing unit (CPU). For example, the RISC-V vector extension v1.0 has deprecated its support for narrow single instruction multiple data (SIMD) computations, and its initial specifications only accounted for 8-bit, 4-bit, 2-bit, and 1-bit data types. As opposed to standard SIMD units, the digital circuit allows many data sizes to be kept compressed in memory, and computed in a SIMD fashion, without incurring data manipulation related area overheads. The digital circuit also features mixed-precision computation support by design, as the clustering widths already account for different data sizes between the data sources. Thus, the digital circuit is capable of computing compressed data and perform flexible SIMD-style computations, whose width is proportional to the data size of every operation operand, without associated overheads. Moreover, implementing the digital circuit, whose key novelty relies on hardware reutilization, does not require any additional data path, or a separate register file or functional unit, leading to a negligible area and power overheads.

The digital circuit relies on standard arithmetic-logic units (e.g., integer multipliers), whose data paths and implementations are already implemented in processors supporting integer computations. Thus, the principal aim of the digital circuit is to efficiently cluster data before the multiplication, and to optimize the data extraction on the multiplier output side. The digital circuit tackles these problems by means of two units, called pre-processing and postprocessing. The pre-processing unit functionality is twofold. The extend unit converts the elements belonging to the operands into the input values II, V, and forwards them to the processor multiplier through output busses to perform the actual computation. The number of elements to be clustered, as well as the bit widths, are specified in the control register. Then, the multiplication or addition result is processed by the mask unit, that composes the final result depending on the type of operation. Specifically, the mask unit extracts a set of bits w of value W according to their position, as defined by the binary segmentation technique.

To speed-up the data compression phase, the pre-processing unit comprises a pack unit, which compresses its input data into their actual data sizes. The pack unit converts the operand elements into the target output bit width, and forwards them to the mask unit for merging. Therefore, the post-processing unit acts either as a filter to extract the meaningful slice of data from the arithmetic-logic unit output, or it is used to compress data, wherein its behaviour depends on the type of operation, as well as on the values set in the control register.

In other embodiments, the digital circuit is applied to a data flow computer architecture; in the example shown in Figure 3, the digital circuit is implemented on a FPGA architecture. FPGA’s are composed by data flow blocks, known as CLB’s, each block comprising logic circuits such as, ALU’s, memory units and/or LUT’s, which are a hardware component specialized in implementing logic units such shifts, Boolean operations, etc.; the CLB’s are interconnected through an interconnection matrix, and are capable of receiving and sending data to a neighbouring block.

In such example, three CLB’s have been connected as a chain, and the first CLB of the chain (CLBO) comprises two memory units that output two new vectors of elements on every clock cycle. The central block of the chain (CLB1), uses an internal ALU as a multiplier, and its LUT’s as logic circuits to implement the pre-processing and the post-processing units of the digital circuit, in other examples, the ALU is an external implemented with a DSP. Thus, the two source vectors from the CLBO are processed by the extend unit of the preprocessing unit to obtain the input clusters, multiplied by the ALU to obtain W, and processed by the mask unit of the post-processing unit to extract one operation result on every cycle. Each result is then stored in a store buffer on a third block (CLB2).