Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LOGIC CELL ARCHITECTURE
Document Type and Number:
WIPO Patent Application WO/2021/163767
Kind Code:
A1
Abstract:
A logic cell architecture for a programmable integrated circuit, comprising: a lookup table having two or more inputs; and an XOR gate having the same number of inputs, the XOR gate connected in a datapath upstream of the lookup table.

Inventors:
RASOULINEZHAD SEYEDRAMIN (AU)
SIDDHARTHA (AU)
BOLAND DAVID (AU)
LEONG PHILIP (AU)
Application Number:
PCT/AU2021/050148
Publication Date:
August 26, 2021
Filing Date:
February 22, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV SYDNEY (AU)
International Classes:
H03K19/17736; G06F9/22; G06F9/30; H03K19/17728
Foreign References:
US20090243652A12009-10-01
US8680886B12014-03-25
US7154299B22006-12-26
Other References:
YUAN YUELAI; TU LE; HUANG KAN; ZHANG XIAOQIANG; ZHANG TIEJUN; CHEN DIHU; WANG ZIXIN: "Area Optimized Synthesis of Compressor Trees on Xilinx FPGAs Using Generalized Parallel Counters", IEEE ACCESS, vol. 7, 18 September 2019 (2019-09-18), pages 134815 - 134827, XP011747984, DOI: 10.1109/ACCESS.2019.2941985
SIMONS TAYLOR, LEE DAH-JYE: "A Review of Binarized Neural Networks", ELECTRONICS, vol. 8, 661, 12 June 2019 (2019-06-12), pages 1 - 25, XP055783520, ISSN: 2079-9292, DOI: https://doi.org/1 0.3390/electronics8060661
Attorney, Agent or Firm:
FPA PATENT ATTORNEYS PTY LTD (AU)
Download PDF:
Claims:
CLAIMS

1. A logic cell architecture for a programmable integrated circuit, comprising: a lookup table having two or more inputs; and an XOR gate having the same number of inputs as the lookup table, the XOR gate connected in a datapath upstream of the lookup table.

2. A logic cell architecture according to claim 1, wherein the lookup table is configured to implement a sum operation and the XOR gate is configured to implement a carry operation.

3. A logic cell architecture according to claim 2, wherein the sum operation and the carry operation are performed in a generalised parallel counter.

4. A logic cell architecture according to claim 3, wherein the generalised parallel counter is configured to implement a binarized neural network.

5. A logic cell architecture, according to any preceding claim, wherein the outputs of the lookup table and the XOR gate are connected to a common multiplexer.

6. A logic cell architecture according to claim 5 or 6, further including a second XOR gate interposed between the XOR gate and the multiplexer.

7. A logic cell architecture according to claim 5, further including an adder interposed between the XOR gate and the multiplexer.

8. A logic cell architecture according to any preceding claim, wherein the XOR gate is a 6-input XOR gate.

9. A logic cell architecture according to claim 8 when dependent on claim 3, wherein the generalised parallel counter is 06:111

10. A logic cell architecture according to any preceding claim, wherein the programmable integrated circuit is a field programmable gate array.

11. A method for configuring a programmable integrated circuit, the method comprising defining a logic cell in the programmable integrated circuit, the logic cell comprising: a lookup table having two or more inputs; and an XOR gate having the same number of inputs as the lookup table, the XOR gate connected in a datapath upstream of the lookup table.

12. A method according to claim 11, wherein the lookup table is configured to implement a sum operation and the XOR gate is configured to implement a carry operation.

13. A method according to claim 12, further including defining a generalised parallel counter in the programmable integrated circuit, wherein the sum operation and the carry operation are performed in the generalised parallel counter.

14. A method according to claim 13, wherein the generalised parallel counter is configured to implement a binarized neural network. 15. A method according to any one of claims 11 to 14, wherein the logic cell is defined such that the outputs of the lookup table and the XOR gate are connected to a common multiplexer.

16. A method according to claim 15 or 16, further including defining a second XOR gate interposed between the XOR gate and the multiplexer. 17. A method according to claim 15, further including defining an adder interposed between the XOR gate and the multiplexer.

18. A method according to any one of claims 11 to 17, wherein the XOR gate is a 6-input XOR gate.

19. A method according to claim 18 when dependent on claim 13, wherein the generalised parallel counter is C6: 111

20. A method according to any one of claims 11 to 19 preceding claim, wherein the programmable integrated circuit is a field programmable gate array.

Description:
Logic Cell Architecture

Field of the invention

The present invention relates generally to a logic cell architecture for a programmable integrated circuit device. The present invention relates specifically, but not exclusively to a logic cell architecture for a field-programmable gate array (FPGA) that is optimised for the efficient implementation of generalised parallel counters and binarized neural networks.

Background of the invention

The flexibility of modern FPGA architectures allows arbitrary arithmetic circuits to be implemented efficiently through the presence of digital signal processing (DSP) blocks for high precision (i.e 18-bit) multiply-accumulate operations, and lookup tables (LUTs) linked with carry-chains for bit-level operations. Generalised parallel counters (GPCs) and compressor trees are examples of digital circuits that can be implemented on a modern FPGA. GPCs and compressor trees are also the building blocks for binarized neural networks, where they are used to perform operations such as popcounts (i.e counting the number of set bits of a wide input).

However, the very flexibility of modern FPGA architectures can make it difficult to implement desired operations and verify their performance. This is particularly the case with certain lower-precision and multi-operand operations that need to be integrated with higher-precision operations. The present invention aims to provide an alternative logic cell architecture.

Reference to any prior art in the specification is not an acknowledgment or suggestion that this prior art forms part of the common general knowledge in any jurisdiction or that this prior art could reasonably be expected to be understood, regarded as relevant, and/or combined with other pieces of prior art by a skilled person in the art. Summary of the invention

According to a first aspect of the present invention there is provide a logic cell architecture for a programmable integrated circuit, comprising: a lookup table having two or more inputs; and an XOR gate having the same number of inputs as the lookup table, the XOR gate connected in a datapath upstream of the lookup table.

The present invention provides a logic cell architecture for a programmable integrated circuit (such as a FPGA) that includes an explicit XOR datapath. This enables certain GPC operations to be implemented using fewer logical units. In particular, the C6: 111 operation can be implemented in an FPGA according to the invention using only two lookup tables. In turn, this enables better resource utilisation on the FPGA.

According to some embodiments, the lookup table is configured to implement a sum operation and the XOR gate is configured to implement a carry operation. These sum and carry operations can be performed within a generalised parallel counter.

In turn, the generalised parallel counter can be configured to implement a binarized neural network.

According to some embodiments, the outputs of the lookup table and the XOR gate are connected to a common multiplexer. The logic cell architecture can further include a second XOR gate interposed between the XOR gate and the multiplexer. Alternatively, the logic cell architecture can further include an adder interposed between the XOR gate and the multiplexer.

Typically the XOR gate is a 6-input XOR gate, but other types of XOR gates can also be used.

According to a second aspect of the present invention, there is provided a method for configuring a programmable integrated circuit, the method comprising defining a logic cell in the programmable integrated circuit, the logic cell comprising: a lookup table having two or more inputs; and an XOR gate having the same number of inputs as the lookup table, the XOR gate connected in a datapath upstream of the lookup table.

Preferably, the lookup table is configured to implement a sum operation and the XOR gate is configured to implement a carry operation. The method may further include defining a generalised parallel counter in the programmable integrated circuit, wherein the sum operation and the carry operation are performed in the generalised parallel counter.

According to some embodiments, the generalised parallel counter is configured to implement a binarized neural network. Preferably, the logic cell is defined such that the outputs of the lookup table and the XOR gate are connected to a common multiplexer.

Optionally, the method further includes defining a second XOR gate interposed between the XOR gate and the multiplexer.

Preferably, the method further includes defining an adder interposed between the XOR gate and the multiplexer.

The XOR gate may be a 6-input XOR gate.

According to some embodiments, the generalised parallel counter is 06:111

Preferably, the programmable integrated circuit is a field programmable gate array.

As used herein, except where the context requires otherwise, the term "comprise" and variations of the term, such as "comprising", "comprises" and "comprised", are not intended to exclude further additives, components, integers or steps.

Brief description of the drawing

Further aspects of the present invention and further embodiments of the aspects described in the preceding paragraphs will become apparent from the following description, given by way of example and with reference to the accompanying drawings, in which:

Figure 1 are block diagrams illustrating the use of parallel FAs to implement a single stage of carry-save addition for a 3b 3-operand addition;

Figure 2 are block diagrams illustrating the operation of compressors;

Figure 3 are block diagrams of adder trees;

Figure 4 shows the dot notation of three different GPCs;

Figure 5 are circuit diagrams of two embodiments of logic cell architectures according to the invention;

Figure 6 is a histogram of the percentage count and cost (in LUTs) for a number of selected GPCs

Figure 7 is a schematic illustration of a logic cell architecture according to an embodiment of the invention;

Figures 8 and 9 are graph illustrating the resource reduction for logic cell architectures according to the invention; and

Figure 10 is a graph illustrating the performance of the logic cell architectures according to the invention in implementing binarized neural networks. Detailed description of the embodiments

Embodiments of the present invention provide a logic cell architecture that optimises GPCs and compressor trees. In this regard, parallel counters are digital circuits that simply count the number of asserted bits in the input, returning this value as a binary output. They can be specified in (p:q) notation, where p is the number of input bits, and q is the number of output bits used to express the result in binary notation. Half-adders (HA) and full-adders (FA) are commonly used parallel counters, denoted as (2:2) and (3:2) respectively. Parallel counters can also be expressed in dot notation as shown for the full-adder in Figure 1(a). Figure 1(b) shows how FAs can be used in parallel to implement a single stage of carry-save addition for a 3b 3-operand addition. Each FA takes inputs from a single-column, and hence, all input bits to a parallel counter have the same rank, i.e. they all have the same weight.

Compressors can be thought of as parallel counters, but with an explicit carry-in (Cin) and carry-out (Cout) bits that can be connected to adjacent compressors in the same stage, as shown in Figure 2(b). Unlike carry-propagate adders, the carry-chains between compressors are not connected, and hence, the overall delay of the circuit scales much better.

For multi-operand addition, adder trees can be built by chaining multiple ripple- carry adders (RCA). Figure 3(a) shows addition of 3*3b operands. The carry-out from each FA/HA propagates to the next FA, which results in a long critical path along the carry-chain (shown in red or with dashed arrows). While the RCA has a small area footprint, this long delay is undesirable and can limit performance, especially for operands with large bit-width.

Figure 3(b) shows a “carry-save adder” (CSA) that addresses this issue by breaking the carry-chain. The critical path is shown in red or with dashed arrows. The full adders can operate without having to wait for the carry input to be available, and the output is in a redundant number format involving two bits per output digit. A higher level of parallelism than the RCA is achieved but the output is not in binary format and must be reduced to the final answer using an RCA stage. Nevertheless, the CSA adder reduces the overall delay of the addition. In Figure 3, the critical path delay has one less full-adder-delay. This approach of breaking the carry-chain dependency up until the initial RCA stage is the basis behind compressor trees. A compressor tree is a circuit that takes in a set of binary values (or dots) that represent multiple operands, and outputs the result as a sum and carry. Stage 0 in Figure 1b is a compressor tree that produces sum and carry bits as inputs into Stage 1, which are then evaluated by an RCA to produce the final result. In this regard, the HA FA HA row in figure 3(b) is the RCA stage. Compressor trees can be built using parallel counters, compressors, or both.

Generalized Parallel Counters map efficiently to FPGAs. Unlike parallel counters, GPCs allow input bits to have different weights, which, in the dot notation, make the GPCs appear as multi-column counters. Figure 4 shows the dot notation of three different GPCs. Mathematically, GPCs are written as a tuple: ( p n-1 , ...,p 1 ,p 0 :q 0 ,q 1 , where p, is the number of input bits in the I th column, and q j is the number of output bits in the j th column. FPGA implementations of GPCs can be classified as LUT- based GPCs, or carry-chain-based GPCs. The “shape” of a GPC can have a significant impact on its hardware implementation on FPGAs, and subsequently its performance and efficiency in a compressor tree.

Popular metrics to quantify the efficiency of a GPC include: where p and q are the number of input and output bits to/from the GPC respectively, k is the area utilization (in LUTs) of the GPC, and d is the critical path delay (in nanoseconds) of the GPC implementation. In this specification, adders/compressors/GPCs are annotated with a simplified notation omitting commas. For example, the GPC (6:1, 1,1) is annotated as C6:111 , the (4:2) compressor by C4:2, or the full adder (3:1,1) by C3:11.

FGPA Logic Elements The logic cell architecture according to embodiments of the present invention can be suitably implemented on a variety of FPGAs, including the Xilinx® configurable logic block (CLB) and the Intel® FPGA.

The Xilinx® CLB is composed of two slices, which are the basic unit of the FPGA’s soft-fabric. Figure 5(a) (in black) shows the slice architecture found in the modern Xilinx® UltraScale+ FPGA, in which each slice is composed of four 6-input LUTs 50 (one shown in Figure 5(a) and additional circuitry such as registers 52 and multiplexers 54 that give the slice its expressiveness. Another notable feature of the slice is the presence of a fast carry-chain between the LUTs, which is often used to implement arithmetic circuits such as RCAs. The logic cell architecture according to an embodiment of the present invention is implemented at the slice level.

The main logic element in Intel® FPGAs is the adaptive logic module (ALM). Figure 5(b) (in black) shows the ALM architecture of a modern Stratix-10 device. Each ALM is composed of four 4-input LUTs 51, while primitives such as full-adders 53 and multiplexers 55 assist in supporting higher-order boolean functions. Ten ALMs on Intel® FPGAs are grouped to form a logic array block (LAB), which augments the ALMs with more primitives such as hyperlex registers, local interconnect, and fracturable carry- chains. The logic cell architecture according to another embodiment of the present invention is implemented at the ALM level.

As shown in Figure 5(a) a 6-input XOR gate (XOR6) 56 is connected upstream of the input of 6-input LUT 50. A corresponding modification is made in Figure 5(b), where a 6-input XOR gate 57 is upstream of LUT 51. Incorporating an XOR gate in this way exploits the inventors’ findings that the C6:111 GPC is dominant in modern FPGA- based compressor tree designs. To quantify the finding, analysis was performed of optimal solutions of compressor trees from a set of 50+ benchmarks extracted from various domains (including popcounting, multi-operand addition and FIR filters, etc).

In this regard, Figure 6 is a histogram of the percentage count and cost (in LUTs) for all GPCs across all solutions. Due to its compression efficiency, C6: 111 is used more than a third of the time, and as a result, most of the hardware is dedicated towards its implementation. In modern FPGAs, the C6: 111 maps to 3 LUTs. However, the present invention, in providing an explicit XOR6 datapath inside each logic cell, allows the operation to be performed by just two LUTs. At least in theory, the present invention can deliver a reduction in resource utilization of the most commonly-used GPC by 33%.

Another advantage of the present invention is its applicability to binarized neural networks (BN Ns). In BN Ns, the core computational workload is generated by the convolution layers, which reduce to a XnorPopcount operation in its binarized format. In this regard, the XnorPopcount operation for 3 binary weights ( w0 , w1, and w2), and 3 corresponding binary activations (x0, x1, and x2). The computation required is: where F e resents an XNQR operation » ami X represents an XDR operations.

The XnorPopcount instruction is synthesized as a fused instruction that is mapped to 2 LUTs on modern FPGAs, as shown in Figure 7(a) - one LUT to compute the sum bit, and the other to compute the carry bit. According to the invention however, this computation can be mapped to just a single logic element due to a subtle Boolean transformation, where the Sum bit (S) can now be expressed as:

The LUT-6 implements the carry logic in this case, and both outputs from a single Xilinx slice can now be used productively to compute the partial products of the binarized convolution layer (see Figure 7(b)). Finally, to compute the output activations of the convolution layer, all the partial sums have to be summed, which can be visualized as a tall two-column many-operand instruction of carry and sum bits, as shown in Figure 7(c). This can be efficiently reduced using a compressor tree, which is also improved by the cell architecture according to the present invention.

In another aspect, the present invention involves certain reconfigured datapaths for the FGPAs respectively shown in Figures 5(a) and 5(b). The blue (or dotted) datapath in Figure 5(a) is a modified carry chain datapath, with additional logic allowing the output from the XOR6 gate 56 to be propagated into the carry-chain, which improves the implementation of carry-chain-based GPCs (especially “atomic” GPCs). For example, an atomic GPC can be implemented with a single LUT, instead of two LUTs. This enables the development of new atomic GPCs, such as C06060606: 111111111, which can be mapped to just a single slice. This particular atom has a very high compression efficiency of 3.75.

Table 1 summarizes the characteristics of the atomic GPCs that are improved by the datapath according to the present invention on Xilinx® FPGAs. The datapath is labelled “Luxor+) in the table. Table 1: Slice-based atomic GPCs for Xilinx FPGAs

In the Intel® FPGA illustrated in Figure 5(b), a majority circuit and full-adder have been added to the ALM datapath, as shown in blue (or dotted) in Figure 4b. This modification directly impacts the C25:121 GPC, which can now be captured in a single ALM instead of two ALMs. Note that in Figure 6, the C25:111 GPC is also a very efficient GPC found often in compressor tree solutions of various benchmarks. Compressor tree problems can be suitably mapped onto the logic cells according to the present invention. There are typically two main algorithmic methods for building compressor tree, namely a heuristics-based search and an integer linear programming (ILP) model solver. The latter approach was used in the present invention to build an ILP model, with Table 2 summarizing the list of variables.

Table 2: Variables use in tbe ILP model

The goal is to minimize the overall cost of implementing a compressor tree. This corresponds to the total number of compressors of each type multiplied by their cost, for each stage of a compressor tree, as described by (5). (5)

The minimum number of stages required, St , varies from benchmark to benchmark. A previous approach modelled the maximum number of stages as a heuristic and added it to the cost function. However, this approach was observed to be slow for difficult problems, and in some cases the solver returned a solution that took more stages than required. Instead, an iterative approach was taken, where the solver was queried to find its best solution within a fixed maximum stage limit, S max , which was increased one at a time until a feasible solution was found. This method also minimizes delay, which is introduced by each additional stage. In practice, the solver was able to determine infeasbility within a few seconds, whilst being able to find a feasible integer solution within a few minutes.

In each stage, the number of carry-bits in each column are computed in (4.3), where the division by two is due to the increase in the column’s radix.

This can be formulated as an ILP constraint as follows:

Note that the number of input carry-bits into the first column is always set to 0. When solving the model iteratively, as described above, the constraints on the final stage guide the solver to converge to the solution. A “ragged” carry-propagate architecture was used for the final accumulation stage for Xilinx FPGAs. Unlike previous approaches using a heuristic solver, the ragged carry-propagate adder was modelled for the final stage as three constraints: for c = 0,1,2,...,C-1 and s - S1- 1

Finally, since Intel FPGAs cannot benefit from the ragged carry propagate adder, the ILP constraints were modelled for Intel FPGAs as follows: The efectiveness of the logic cell architectures according to the invention were studied by using the ILP to optimise compressor trees. The chosen compressor sets used to construct the trees are summarized in Table 3.

Table 3: CPCs ,< Compressors used in this paper When targeting Xilinx architectures the cost of C6:111 GPCs was reduced from 3 to 2 logic elements, as described above. For the datapath modifications, the slice-based GPCs described in Table 1 were added to the ILP. These results are denoted as X- LUXOR and X-LUXOR+ respectively.

Table 4 shows the efficiency and compression metrics of diferent GPCs used to target Intel architectures. These are separated into two main groups: LUT-based and arithmetic-based. According to the APD (Equation 3) metric (noted above), which measures the eiciency of a GPC taking into account delay and resource usage, LUT- based GPCs such as C6:111, C15:111 , C23:111 are the most efficient GPCs for Intel FPGAs. Although some of the proposed GPCs offer better S (compression rates), such as C7:111 , their reported delay is 3.5 x greater. Table 4: Comparison of dilferent GPCs proposed in [22] and new GPCs supported by LUXOR and LUXOR+

LUT-teed i Arithmetic

I GPCs I s L

Delay 1 LUTs | APD | Delay | Oil ' s | APD cum I a ϋ,h 8,30 3 1 73 1 0.07 | 2 | 4,0

C 15:111 II 2 0 0,30 3 1 7J j 8.87 1 2 4,0 cam Ik, 5? 0 0.30 2 S,3 j j % ZA

€7,111 1233 8 1,30 4 j 1.3 j 0,98 1 2,5 | 0,5

I [20] oiiluflks 0.44 ~TM~1 5 GIPIIP

05;:!t!i [[ 2 8.2$ 1,30 S j 3.4 I 1.01 j 3 j 5,3 ciimmjj ir 0.10 kioT ^ T j iJ ^ piJr^ o¾5m]| F " <U3

€42:11111 2 0,00 1,30 5 j 2.4 1 1,01 j 3 | 5,3

\ g

I

!

I e d

*C t ming ov i rim e nts

Sim GPCs was C25:111 w It m are highlig timing and lower logic

To evaluate the improvements of the architectures according to the invention, different benchmarks from three categories were used, namely: 1) Low-rank inputs including pop-count and two-column count 2) High-rank inputs including multi-addition, 3-MAC operation (described below), and a random-generated FIR-3 filter, and 3) BNN XnorPopcount operation for various input sizes. These three benchmarks highlight the benefits of the architectures according to the invention.

The 3-MAC operation was modelled according to the following equation:

It may be noted that since there are 3 pairs of inputs, instead of computing partial products and summing their results, partial products of the same rank can be selected and a primary compression performed. The cost of this step is included in the relevant result. The resulting tree forms the input to the compressor. This was repeated for different bitwidths (N).

The Intel Stratix-10 ALM unit, and Xilinx UltraScale slice were modelled according to their respective data sheet descriptions. Post synthesis results were reported and the synthesis strategy was set to “Timing Optimization” since it usually leads to a better Area x Delay product.

Table 5 provides the post-synthesis area and timing results for the Intel baseline, l-LUXOR, Intel+MajFA and I-LUXOR+ modifications to the ALM.

Table 5: ASIC results for the lntel Stratix-10 AIAI architec- ture From the table, it can be seen that the area increase of l-LUXOR is about 1% while the delay increase is less than 0.5%. This demonstrates that there is little overhead associated with adding a 6-input XOR gate to the ALM unit. In contrast, adding MajFA circuits will increase the area and delay by 2% and 5% respectively (see above). The full I-LUXOR+ implementation, has 3% and 5% delay and area overhead respectively. This somewhat large increase in area compared to the individual effect of each modification arises from the performance-driven synthesis optimization. For measuring the critical path, the multiplexers connecting the ALM’s outputs to its input were removed.

The synthesised Xilinx baseline model slice has an area of 6045um 2 . The reported critical path was compared with that from the Virtex-5 datasheet, which is a device that is also manufactured with the similar 65nm process. The critical path from an input, through 4 carry circuits to the output (TITO) was 0.67, 0.77, or 0.90-ns for three different speed grades. Comparing these values with the achieved 0.84 ns from Table 6, consistency with the synthesis results was verified. The same table shows that X-LUXOR has similar area utilization and a 6% increase in delay, while X-LUXOR+ has 6% area overhead and an increase of 9% in delay.

Table 6: ASIC results for the Xilinx UltraScale slice architec- ture

The effect of the ILP approach on resource utilization in logic elements is affected by the choice of primitives in the primary stage (if applicable), compression tree stages, and the last stage (final ternary adder in Intel or the equivalent relaxed ternary adder for Xilinx architectures). Table 7 compares the architectures of the present invention with an earlier approach

Table 7: A comparison of solution from our ILP solver compared with those in [22] - 1

Heuristics used in [22] Efficienc/Strength/Product, reported in that order.

2 bM = Benchmark, LE = Logic elements (LUTs), St ·= mnn compressor tree stages

As can be seen in the baseline column, the ILP approach uses fewer logic elements (LEs) and stages (St) for all benchmark problems compared with the heuristic approach since an optimal solution is found. While the X-LUXOR enhancement significantly reduces the number of LEs compared with the baseline, LUXOR+ achieves a further reduction in the number of stages. Figure 8 shows the savings in LEs for Xilinx architectures over a larger benchmark set, with the red star also indicating a reduced number of stages by one. For low-rank inputs (i.e. popcount and two-column popcount), the C6: 111 and C25: 121 compressors are heavily used. X-LUXOR improves the resource efficiency of C6: 111 implementations and achieves the best savings for the 1024-input popcount problem at 22% reduction. Less improvement is seen for two- column popcount, as in the first stage, C25:121 has better arithmetic slack (A) while offering the same efficiency. X-LUXOR+ offers a new set of the state of the art compression rate and compression eiciency. On average, X-LUXOR+ can reduce area utilization on the low-rank input popcount and two-column popcount benchmarks by 22% and 15% respectively.

For the high-rank benchmarks (multi-operand addition, FIR-3 and 3-MAC), the inputs are wide enough to benefit from the slice based GPCs. The 06:111 compressor is not signiicantly utilized. However, X-LUXOR+ offers higher compression rates and hence achieves 30% and 18% improvement in 3-MAC and multi-addition benchmarks respectively and in some cases the required number of stages is also reduced by one.

Figure 9 shows the result of ILP model for Intel, l-LUXOR, and I-LUXOR+ architectures. More dramatic resource savings are apparent over Xilinx, particularly for low-rank problems using I-LUXOR+. Since l-LUXOR and I-LUXOR+ do not present new compressors, no reduction in number of stages is achieved. However, because the baseline offers no wide GPC, the resource reduction of l-LUXOR is more significant (averaging 24% and 17% for popcount and double popcount). I-LUXOR+ offers an enhanced C25:121 GPC which is the most efficient GPC for the Intel architecture. This leads to 35% and 39% resource savings for popcounting and two-column counting.

Binarized neural networks offer a new challenge for FPGA architectures as 1-bit multiply-accumulate operations require XNOR and popcount operations to be efficient. As noted above, the first computation stage (Multiplication) should be merged with the early compression circuits, leading to an efficient implementation (as illustrated in Figure 7(a)). If the number of input pairs is N, N/3 fused units are required in the primary stage. LUXOR can implement this fused computation using a single LE rather than two Les in the baseline architectures leading to N/3 fewer LE utilization. In addition, after implementation of the primary stage, a two-column counting problem with the height of N/3 is encountered.

As shown in Figure 10, these two optimizations lead to almost the same 34% resource reduction for LUXOR modification on both Xilinx and Intel architecture. Moreover, as described before, X-LUXOR+ can not reduce the number of LEs significantly for low-rank inputs, and hence, the best area savings for BN Ns plateaus at 37%. However, in the case when the input size is 3x3x256, the number of stages is reduced by one, which would provide a significant improvement in delay. In comparison, I-LUXOR+ reduces the number of LEs significantly at an average of 47%, but without reducing the number of stages.

It will be understood that the invention disclosed and defined in this specification extends to all alternative combinations of two or more of the individual features mentioned or evident from the text or drawings. All of these different combinations constitute various alternative aspects of the invention.