Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DATA PATH ELEMENTS FOR IMPLEMENTATION OF COMPUTATIONAL LOGIC USING DIGITAL VLSI SYSTEMS
Document Type and Number:
WIPO Patent Application WO/2023/228213
Kind Code:
A1
Abstract:
Data path elements implemented using a plurality of logic primitives are provided. The plurality of logic primitives are connected in one or more topologies to perform an atomic operation on at least two atomic input data of a pre-defined bit-size. The atomic operation is split into a plurality of sub-atomic operations and the at least two atomic input data is split into a plurality of sub-atomic data fragments. Each of the plurality of logic primitives perform a sub-atomic operation from the plurality of sub-atomic operations on at least two sub-atomic data fragments from the plurality of sub-atomic data fragments to generate at least one partial sub-atomic output data. The at least one partial sub-atomic output data is generated based on a partial arithmetic operation, a shift operation, or a logical operation performed by each of the plurality of logic primitives in each clock-cycle. The architectures of a plurality of data-path elements comprised of a plurality of logic primitives in a plurality of connection topologies are provided.

Inventors:
PANDEY KUMAR SAMBHAV (IN)
SHRIMALI HITESH (IN)
Application Number:
PCT/IN2023/050499
Publication Date:
November 30, 2023
Filing Date:
May 25, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PANDEY UMA (IN)
International Classes:
G06F30/327
Foreign References:
US20100217960A12010-08-26
US20120259907A12012-10-11
Attorney, Agent or Firm:
SINGH, Jashandeep (IN)
Download PDF:
Claims:
CLAIMS

I/We Claim:

1 . A data path element, comprising: a plurality of logic primitives connected in one or more topologies to perform an atomic operation on at least two atomic input data of a pre-defined bit-size, wherein the atomic operation is split into a plurality of sub-atomic operations and the at least two atomic input data are split into a plurality of subatomic data fragments, wherein each of the plurality of logic primitives perform a sub-atomic operation from the plurality of sub-atomic operations on at least two sub-atomic data fragments from the plurality of sub-atomic data fragments to generate at least one partial sub-atomic output data, and wherein the at least one partial sub-atomic output data is generated based on a partial arithmetic operation, a shift operation, or a logical operation performed by each of the plurality of logic primitives in each clock-cycle.

2. The data-path element as claimed in claim 1 , wherein the one or more topologies comprises a serial topology, a parallel topology and/or a cascade topology.

3. The data-path element as claimed in claim 2, wherein the plurality of logic primitives are connected in a serial topology by establishing at least one serial data path to output at least one partial sub-atomic output data from at least one preceding logic primitive of the plurality of logic primitives as input to at least one succeeding logic primitive of the plurality of logic primitives in a next clock-cycle.

4. The data-path element as claimed in claim 2, wherein the plurality of logic primitives are connected in a parallel topology by establishing at least two parallel data paths to output two partial sub-atomic output data each from two logic primitives from the plurality of logic primitives in parallel as input to at least two succeeding logic primitives of the plurality of logic primitives in the next clock clock-cycle.

5. The data-path element as claimed in claim 2, wherein the plurality of logic primitives are connected in a cascade topology by establishing at least two parallel data paths to output two partial sub-atomic output data each from at least one logic primitive from the plurality of logic primitives in parallel as input to at least one succeeding logic primitive of the plurality of logic primitives in a next clock-cycle.

6. The data-path element as claimed in claim 1 , wherein each of the plurality of logic primitives are configured to perform one of a logic operation, a shift operation, a mask operation, a partial addition operation, a partial subtraction operation, a partial multiplication operation, a compare operation, a multiplexing operation, or a demultiplexing operation.

7. The data-path element as claimed in claim 1 , is configured as an adder data path element to perform the atomic operation of an addition on the at least two atomic data, wherein the adder data path element comprises at least two logic primitives, each performing a partial addition operation.

8. The data-path element as claimed in claim 1 , is configured as a subtractor data path element to perform the atomic operation of a subtraction on the at least two atomic data, wherein the subtractor data path element comprises at least two logic primitives, each performing a partial subtraction operation.

9. The data-path element as claimed in claim 1 , is configured as a multiplier data path element to perform the atomic operation of a multiplication on the at least two atomic data, wherein the multiplier data path element comprises at least four logic primitives, and wherein the at least four logic primitives are configured to perform a partial multiply-add operation and/or a partial addition operation.

10. The data-path element as claimed in claim 1 , is configured as a bitwise logic data path element to perform the atomic operation of a bitwise logic on the at least two atomic data, wherein the bitwise logic data path element comprises at least two or more logic primitives performing a logic operation.

11 . The data-path element as claimed in claim 1 , is configured as a bi-directional shifter to perform the atomic operation of a bi-directional bit-wise shift of the at least two sub-atomic data, wherein the bi-directional shifter comprises: at least one logic primitive performing a mask operation, at least two logic primitives each performing a left shift operation, at least one logic primitive performing an OR operation, and at least one logic primitive performing a multiplexing operation.

12. The data-path element as claimed in claim 1 , is configured as a multiplexer data path element to perform the atomic operation of selecting one of the inputs from the at least two atomic data, wherein the multiplexer data path element comprises at least two mux logic primitives, each performing a multiplexing operation.

13. The data-path element as claimed in claim 1 , is configured as a demultiplexer data path element to perform the atomic operation of steering the input to the at least one of two output paths, wherein the demultiplexer data path element comprises at least two demux logic primitives, each performing a demultiplexing operation.

14. The data-path element as claimed in claim 1 , is configured as a comparator data path element to perform the atomic operation of compare on the at least two atomic data, wherein the comparator data path element comprises: at least two compare logic primitives, each configured to perform a compare operation, and one mask logic primitive configured to perform a mask operation.

Description:
DATA PATH ELEMENTS FOR IMPLEMENTATION OF COMPUTATIONAL LOGIC USING DIGITAL VLSI SYSTEMS

CROSS-REFERENCE INFORMATION

[0001] This application is a non-provisional of Indian provisional patent application No.. 202211030038, filed May 25, 2022, entitled “IMPLEMENTATION OF DIGITAL INTEGRATED CIRCUITS ORGANIZED AS SUBSCALAR ARCHITECTURES COMPOSED OF MICRO-CELL LIBRARY BLOCKS” which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

[001] The instant disclosure relates to a method and system for designing ASIC (application specific integrated circuit) implementing computational logic in very large scale integrated (VLSI) system.

BACKGROUND

[002] In semiconductor design, standard-cell methodology is used to design application-specific integrated circuits (ASICs) with mostly pre laid out digital-logic gates. Standard-cell methodology is an example of design abstraction, whereby a low- level very-large-scale integration (VLSI) layout is encapsulated into an abstract logic representation (such as a NAND gate). Using standard-cell methodology, ASICs have been scaled from comparatively simple single-function ICs (of several thousand gates), to complex multi-million gate system-on-a-chip (SoC) devices which are used in personal computers, graphic cards, digital cameras, smart devices, etc. These computing structures implement computational logic which are implemented using a number of transistors. Their processing throughput depends greatly on the computational logic and the number of transistors being used. Further, parallelism is one of the key mechanisms for enhancing the processing throughput. It is known in the art that the presence of data-flow dependencies adversely impacts the exploitation of such parallelism. The performance of digital systems cannot be arbitrarily enhanced merely by way of exploiting parallelism at data-word boundaries in presence of such data-flow dependencies. A deeper inspection and research on the architectures of arithmetic computing structures, reveal that neither all the bits of the result are produced simultaneously nor do all the bits of operands are consumed simultaneously in any logical operation. Further, some implementations in prior art, operate on i

SUBSTITUTE SHEET (RULE 26) operands with less precision in order to be faster and to consume less silicon resources by way of compromising on data width in one way or the other.

[003] In prior art ASICs are designed using EDA tools and resulting circuits are comprised of logic primitives chosen either from standard cell libraries or from pre laid - out macro cells. Such primitives may also be chosen from microcell libraries which are at a higher abstraction than that of standard cells and lower than that of macro cells.

[004] Therefore, there is a requirement to develop some useful data-path elements comprising of microcells which may help implement a computational methodology utilizing parallelism in a manner that is resource friendly and allows processing with higher efficiency and speed.

SUMMARY

[005] In an embodiment, data path element implemented using a plurality of microcells is provided. The plurality of microcells may be connected in one or more topologies to perform an atomic operation on at least two atomic input data of a pre-defined bit-size. In an embodiment, the atomic operation may be split into a plurality of sub-atomic operations and the at least two atomic input data may be split into a plurality of subatomic data fragments. In an embodiment, each of the plurality of microcells may perform a sub-atomic operation from the plurality of sub-atomic operations on at least two sub-atomic data fragments from the plurality of sub-atomic data fragments to generate at least one partial sub-atomic output data. In an embodiment, the at least one partial sub-atomic output data may be generated based on a partial arithmetic operation, a shift operation, or a logical operation performed by each of the plurality of microcells in each clock-cycle.

[006] In an embodiment, the one or more topologies may include a serial topology, a parallel topology and/or a cascade topology. In an embodiment, the plurality of microcells may be connected in a serial topology by establishing at least one serial data path to output at least one partial sub-atomic output data from at least one preceding microcell of the plurality of microcells as input to at least one succeeding microcell of the plurality of microcells in a next clock-cycle.

[007] In an embodiment, the plurality of microcells may be connected in a parallel topology by establishing at least two parallel data paths to output two partial subatomic output data each from two microcells from the plurality of microcells in parallel

2

SUBSTITUTE SHEET (RULE 26) as input to at least two succeeding microcells of the plurality of microcells in the next clock clock-cycle.

[008] In an embodiment, the plurality of microcells may be connected in a cascade topology by establishing at least two parallel data paths to output two partial subatomic output data each from at least one microcell from the plurality of microcells in parallel as input to at least one succeeding microcell of the plurality of microcells in a next clock-cycle.

[009] In an embodiment, each of the plurality of microcells may be configured to perform one of a logic operation, a shift operation, a mask operation, a partial addition operation, a partial subtraction operation, a partial multiplication operation, a compare operation, a multiplexing operation, or a de-multiplexing operation.

[0010] In an embodiment, the data path element may be configured as an adder data path element to perform the atomic operation of an addition on the at least two atomic data. In an embodiment, the adder data path element may include at least two microcells, each performing a partial addition operation.

[0011] In an embodiment, the data path element may be configured as a subtractor data path element to perform the atomic operation of a subtraction on the at least two atomic data. In an embodiment, the subtractor data path element may comprise at least two microcells, each performing a partial subtraction operation.

[0012] In an embodiment, the data path element may be configured as a multiplier data path element to perform the atomic operation of a multiplication on the at least two atomic data. In an embodiment, the multiplier data path element may comprise at least four microcells, performing at least a partial multiply-add operation and/or a partial addition operation.

[0013] In an embodiment, the data path element may be configured as a bitwise logic data path element to perform the atomic operation of a bitwise logic on the at least two atomic data. In an embodiment, the bitwise logic data path element may comprise at least two or more microcells performing a logic operation. In an embodiment, the logic operation may be a bit-wise operations or a binary operation.

[0014] In an embodiment, the data path element may be configured as a bi-directional shifter data path element to perform the atomic operation of a bi-directional shift of the at least two sub-atomic data. In an embodiment, the bi-directional shifter may comprise at least one microcell performing a mask operation, at least two microcells each 3

SUBSTITUTE SHEET (RULE 26) performing a bit-wise left shift operation, at least one microcell performing an OR operation, and at least one microcell performing a multiplexing operation.

[0015] In an embodiment, the data path element may be configured as a multiplexer data path element to perform the atomic operation of selecting one of the inputs from the at least two atomic data. In an embodiment, the multiplexer data path element may comprise at least two mux microcells, each performing a multiplexing operation.

[0016] In an embodiment, the data path element may be configured as a demultiplexer data path element to perform the atomic operation of steering the input to the at least two output paths. In an embodiment, the demultiplexer data path element may comprise at least two demux microcells, each performing a demultiplexing operation.

[0017] In an embodiment, the data path element may be configured as a comparator data path element to perform the atomic operation of compare on the at least two atomic data. In an embodiment, the comparator data path element may comprise at least two compare microcells, each configured to perform a compare operation, and one mask microcell may be configured to perform a mask operation.

[0018] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, explain the disclosed principles.

[0020] FIG. 1 illustrates a microcell of a microcell library, in accordance with an embodiment of the present disclosure.

[0021] FIG. 2 illustrates an adder data path element, in accordance with an exemplary embodiment.

[0022] FIG. 3 illustrates a subtractor data path element, in accordance with an exemplary embodiment.

[0023] FIG. 4 illustrates a multiplier data path element, in accordance with an exemplary embodiment.

4

SUBSTITUTE SHEET (RULE 26) [0024] FIG. 5 illustrates a bitwise logic data path element, in accordance with an embodiment of the present disclosure.

[0025] FIG. 6 illustrates a bi-directional shifter data path element, in accordance with an embodiment of the present disclosure.

[0026] FIG. 7 illustrates a multiplexer data path element, in accordance with an embodiment of the present disclosure.

[0027] FIG. 8 illustrates a demultiplexer data path element, in accordance with an embodiment of the present disclosure.

[0028] FIG. 9 illustrates a comparator data path element, in accordance with an embodiment of the present disclosure.

[0029] FIG. 10A, FIG. 10B and FIG. 10C illustrates area-throughput figure-of-merit (FOM) for the unpipelined, pipelined, and subscalar implementations at pair, nibble, and byte valencies of the chosen benchmark circuits are plotted as histograms for an 8-bit, 16-bit, 32-bit data-path widths, in accordance with an experimental embodiment of the present disclosure.

DETAILED DESCRIPTION

[0030] The present invention presents enabling various combinatorial topologies using a cell library at an abstraction that is higher than that of standard cells, but lower than macro cells.

[0031] Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims. Additional illustrative embodiments are listed below.

[0032] A subscalar operation operates on sub-atomic data fragments and performs only partial operations on them. The atomic data and atomic operations are broken down into sub-atomic data fragments and sub-atomic partial operations respectively. Such a break-up exposes hitherto unexploited levels of parallelism by way of allowing overlap of operations even if they are data dependent. Applicants have found that the

5

SUBSTITUTE SHEET (RULE 26) improved exploitation of latent parallelism to enhance processing throughputs comes with a favourable impact on the area-power characteristics of corresponding computing structures. A number of these subscalar instructions are connected in series, parallel, cascade or compound topologies to perform an atomic operation on atomic data. As these subscalar operations or instructions operate on sub-atomic data fragments, even data-flow dependent operations may be temporally overlapped, thereby giving rise to a new and novel concept of sub-instruction level parallelism (sILP). Such atomic operations may be performed using a combination of microcells connected in one or more topologies such as, but not limited to series, parallel, cascade or compound topologies.

[0033] In a pipelined subatomic implementation, using a single instance of the implementation unit, the atomic operations may be computed by splitting each of the atomic operations into one or more sub-atomic operations. In an embodiment, an atomic operation may be split into sub-atomic operations based on a complexity level of the atomic operation and time required to perform each of the subatomic operations. Accordingly, one or more of the subatomic operations may be computed in one clock cycle and any atomic operation may be implemented using a combination of microcells connected in one or more topologies. Further, the computation of each atomic operation may be done by operating on their corresponding atomic input operands or atomic datum which may be split into two or more sub-word or sub-atomic data fragments. In an embodiment, the valency of the subatomic data fragment may be equal to 1 bit, 2 bit, 4 bit, 8 bit, 16 bit, 32 bit, and so on.

[0034] In an embodiment, the microcells may be defined as pre-configured logic primitives that may be connected in various combinations and topologies to create useful arithmetic and logic modules like adders, shifters, multiplexers, comparators, etc., from which complete data-paths can be synthesized.

[0035] FIG. 1 illustrates a microcell of a microcell library, in accordance with an embodiment of the present disclosure. Jn an embodiment, a typical microcell 100 may implement a primitive logic and may have a uniform interface with three input operands a, b and cat an input interface and two output operands x and yat an output interface. In an embodiment, one or more of the three input operands and the two output operands may be enabled based on electric connection. Further, the two output operands x and y are latched to output registers 102 and 104 in a clock cycle.

6

SUBSTITUTE SHEET (RULE 26) [0036] In an embodiment, a library including a plurality of microcells 100 may be defined and hereinafter referred to as a microcell library. The each microcell 100 of the microcell library may be defined to implement partial arithmetic, bitwise-logic, shift or control-flow operations operation based on one or more of the three input operands a, b and c to generate the two output operands x and y.

[0037] In an embodiment, the three input operands a, b and c_and each of the two output operands x and y may have a uniform pre-defined valency. In an embodiment, the uniform pre-defined valency may be selected from a pair (2-bits), a nibble (4-bits), a byte (8-bits), a half-word (16-bits), or an integer power of 2.

[0038] In an embodiment, the implementation of the logical operations or the partial arithmetic operations using one or more microcells from the library of microcells may be in a pipelined manner or an unpipelined manner.

[0039] In an embodiment, one or more of the plurality of microcells are combined for performing the partial arithmetic operation or the logical operation at a valency higher than the pre-defined valency.

[0040] In an embodiment, the microcell library may include, but not limited to, a logic microcell, a shift microcell, a mask microcell, a partial adder (padd) microcell, a partial subtractor (psub) microcell, a partial multiplier (pmlt) microcell, a compare microcell, a mux microcell, a demux microcell, etc.

[0041] In an embodiment, a data path element may be implemented using a plurality of microcells from the microcell library. The plurality of microcells may be connected in one or more topologies to perform an atomic operation on at least two atomic input data of a pre-defined bit-size. In an embodiment, the atomic operation may be split into a plurality of sub-atomic operations and the at least two atomic input data may be split into a plurality of sub-atomic data fragments. In an embodiment, each of the plurality of microcells may perform a sub-atomic operation from the plurality of subatomic operations on at least two sub-atomic data fragments from the plurality of subatomic data fragments to generate at least one partial sub-atomic output data. In an embodiment, the at least one partial sub-atomic output data may be generated based on a partial arithmetic operation, a shift operation, or a logical operation performed by each of the plurality of microcells in each clock-cycle.

[0042] In an embodiment, various data path elements may be implemented as combinatorial topologies using two or more microcells from the microcell library. In an

7

SUBSTITUTE SHEET (RULE 26) embodiment, various data path elements may comprise, but not limited to, an adder data path element, a subtractor data path element, a multiplier data path element, a bitwise logic data path element, a bi-directional shifter data path element, a multiplexer data path element, a demultiplexer data path element, a comparator data path element, etc.

[0043] In an embodiment, a plurality of microcells may be connected in a serial topology by establishing at least one serial data path to output at least one partial subatomic output data from at least one preceding microcell of the plurality of microcells as input to at least one succeeding microcell of the plurality of microcells in a next clock-cycle.

[0044] In an embodiment, a plurality of microcells may be connected in a parallel topology by establishing at least two parallel data paths to output two partial subatomic output data each from two microcells from the plurality of microcells in parallel as input to at least two succeeding microcells of the plurality of microcells in the next clock clock-cycle.

[0045] In an embodiment, a plurality of microcells may be connected in a cascade topology by establishing at least two parallel data paths to output two partial subatomic output data each from at least one microcell from the plurality of microcells in parallel as input to at least one succeeding microcell of the plurality of microcells in a next clock-cycle.

[0046] FIG. 2 illustrates an adder data path element 200, in accordance with an exemplary embodiment. The adder data path element 200 may perform the atomic operation of an addition on at least two atomic data a and b. In an embodiment, the adder data path element 200 may include at least two padd microcells 202a-d, each performing a partial addition operation. In an embodiment, the atomic data a and b may be of valencies of 2-bit, 4-bit, 8-bit or any other integer power of two bit. Each of the atomic data a and b may be split into sub-atomic data of minimum valency of 1 bit each. In an exemplary embodiment, the atomic data a and b may be split into subatomic data fragments ao-3 and bo-3 each specified as input operands to each of the padd or partial adder microcells 202a-d. The adder data path element 200 may have a latency of 4 clock cycles. But the throughput and the initial delay may be just 1 clock cycle each. This means that the adder data path element 200 is capable of producing results every clock cycle fragment by fragment while also consuming the

8

SUBSTITUTE SHEET (RULE 26) operands every clock cycle fragment by fragment. On the contrary, in case of the state- of-the-art pipelined parallel prefix designs, the throughput is 1 clock cycle but the latency and initial delay are both 5 clock cycles. Therefore, two data-dependent additions will necessarily take 10 clock cycles to complete, while in the case of subscalar designs they will complete in 5 cycles only. Moreover, the pipelined parallel prefix designs are composed of 14 logic blocks, while the subscalar designs are composed of only 4 logic blocks of comparable complexity which may result in high throughput-area gains. Even if the architecture of the proposed subscalar adder data path element 200 resembles simple carry propagate designs, their use cases are entirely different. The conventional carry propagate adders wait till all the sum bits are produced while the proposed subscalar designs move ahead to compute next sum fragment as soon as any earlier fragment has finished its computation. During this time the next higher significant fragment of the earlier computation is also taken up. This ensures a much higher throughput. Longer is the chain of such data dependent additions, higher are the throughput gains.

[0047] In an embodiment, the each of the padd or partial adder microcells 202a-d may be configured to perform a partial addition operation to output the sum of the three input operands ao-3 and bo-3 and Co-3- The two output operands may represent a sum output so-3, and y 0 uto-3 of the sum. In an embodiment, the three input ports of the three input operands a, b and c and the two output ports of the two output operands s and y are defined as: c = carry in (1 ) b = augend (2) a = addend (3) s = sum (4) y = carry out (5)

[0048] FIG. 3 illustrates a subtractor data path element 300, in accordance with an exemplary embodiment. The subtractor data path element 300 to perform the atomic operation of a subtraction on the at least two atomic data a and b. In an embodiment, the subtractor data path element 300 may include at least two microcells 302a-d, each performing a partial subtraction operation. In an embodiment, the atomic data a and b may be of valencies of 2-bit or more. Each of the atomic data a and b may be split into

9

SUBSTITUTE SHEET (RULE 26) sub-atomic data of minimum valency of 1 bit each. In an exemplary embodiment, the atomic data a and b may be split into subatomic data fragments ao-3 and bo-3 each inputted as input operands to each of the psub or partial subtractor microcells 302a-d. The subtractor data path element 300 may have a latency of 4 clock cycles. But the throughput and the initial delay may be just 1 clock cycle each. This means that the subtractor data path element 300 is capable of producing results every clock cycle fragment by fragment while also consuming the operands every clock cycle fragment by fragment. On the contrary, in case of the state-of-the-art pipelined parallel prefix designs, the throughput is 1 clock cycle but the latency and initial delay are both 5 clock cycles. Therefore, two data-dependent subtractions will necessarily take 10 clock cycles to complete, while in the case of subscalar designs they will complete in 5 cycles only. Moreover, the pipelined parallel prefix designs are composed of 14 logic blocks, while the subscalar designs are composed of only 4 logic blocks of comparable complexity which may result in high throughput-area gains.

[0049] Even if the architecture of the proposed subscalar subtractor resembles simple borrow propagate designs, their use cases are entirely different. The conventional borrow propagate subtractors wait till all the difference bits are produced while the proposed subscalar designs move ahead to compute next borrow fragment as soon as any earlier fragment has finished its computation. During this time the next higher significant fragment of the earlier computation is also taken up. This ensures a much higher throughput. Longer is the chain of such data dependent subtractions, higher are the throughput gains.

[0050] In an embodiment, each of the psub or partial subtractor microcells 302a-d may be configured to perform a partial subtraction operation to output the difference by subtracting a second input operand bo-3 and a third input operand C0-3 from a first input operand ao-3, wherein a first output operand do-3 may represent the difference output in 2’s compliment form, and the second output operand co-3 may represent the borrowout of the difference output. In an embodiment, the three input ports of the three input operands a, b and c and the two output ports of the two output operands d and y are connected as: c = borrow iri (6) b = subtrahend (7) a = minuend (8)

10

SUBSTITUTE SHEET (RULE 26) d = difference (9) y = borrow out (10)

[0051] FIG. 4 illustrates a multiplier data path element 400, in accordance with an exemplary embodiment. The multiplier data path element 400 is configured to perform the atomic operation of a multiplication on the at least two atomic data a and b. In an embodiment, the multiplier data path element 400 may comprise at least four microcells including partial multiplier microcells 402a-j and partial adder microcells 404a-c. The partial multiplier microcells 402a-j may perform a partial multiply-add operation and the partial adder microcells 404a-c may perform a partial addition operation. The subscalar multiplier architecture of multiplier data path element 400 is presented in FIG. 4. The design is a little irregular in a couple of least significant blocks but is capable of achieving latency of 6 cycles, throughput of 1 cycle, and initial delay of 3 cycles and uses 13 blocks which are all better figures than any conventional design of multipliers. The solid rectangles 406 in the architecture presented in FIG. 4 are synchronizing registers and the microcell sink indicates that the respective port is left unconnected. The data-path element 400 may be commonly used in signal processing applications and can be synthesized with a cascade connection of partial multiplier microcells 402a-j and partial adder microcells 404a-c. In case of subscalar designs one can process successive samples every 3 cycles while in case of conventional pipelined designs can process them with a speed of 8 cycles per sample.

[0052] In an embodiment, the partial multiply-add microcell 402a-j may be configured to perform a partial multiply-add operation to output a product of a first input operand ao-3 and a second input operand bo-3 added to a third input operand Co-3- In an embodiment, a first output operand po-3 may represent a lower half significant bit of the output, and a second output operand P4-7 may represent an upper half significant bits of the output. In an embodiment, the three input ports of the three input operands ao-3, bo-3 and C0-3 and the two output ports of the two output operands po-3 and p^-zare connected as: c = augend (1 1 ) b = multiplier (12) a = multiplicand (13)

Po-3 = products (14)

11

SUBSTITUTE SHEET (RULE 26) P -7 = product high (15)

[0053] FIG. 5 illustrates a bitwise logic data path element 500, in accordance with an embodiment of the present disclosure. The bitwise logic data path element 500 may perform the atomic operation of a bitwise logic on the at least two atomic data a and b. In an embodiment, the bitwise logic data path element 500 may comprise at least two or more logic microcells 502a-d each performing a logic operation. In an embodiment, the logic operation may be a bit-wise operation or a binary operation. The bitwise logic data path element 500 may implement bit-wise logic operation in subscalar computational methodology and can be implemented as a cascade connection of logic microcells 502a-d and shown in FIG. 5. The bit-wise logic operation to be computed is based on the third input C0-3 to the logic microcells 502a-d which flows through the design cycle by cycle to ensure that the inputs to and the output from this data-path element follows the subscalar wave shape (staircase). It may achieve throughput and initial delay of 1 cycle and a latency of 4 cycles while it is composed of 4 logic microcells 502a-d.

[0054] In an embodiment, each of the logic microcells 502a-d of the microcell library may be configured to perform a bit-wise logic operation. In an embodiment, for performing the bit-wise logic operation, each of the logic microcells 502a-d may be configured to perform a bit-wise logic operation NOT on at least one of a first input operand ao-3 or other binary logic operations like AND, OR or Ex-OR on a first input operand ao-3, a second input operand bo-3 based on a third input operand co-3 to output a first output operand zo-3 of the two output operands based on the bit-wise logic operation and a second output operand yo-3 of the two output operands z and y as a copy of the third input operand C0-3 as the output y in the next clock cycle which are latched to a register. In an embodiment, the output zo-3 is also latched to a register.

[0055] In an embodiment, each of the logic microcells 502a-d may perform a bit-wise logic operation such as, but not limited to, a complement operation, AND operation, OR operation, EX-OR operation, etc.

[0056] In an embodiment, each of the logic microcells 502a-d may perform a complement operation by complementing one of the first input operand ao-3, in case the third input operand Co-3 is Boolean value of 0. Further, the logic microcell 502a-d may perform an AND operation of the first input operand ao-3 and a second input operand bo-3, in case the third input operand C0-3 is Boolean value of 1. Further, the

12

SUBSTITUTE SHEET (RULE 26) logic microcell 502a-d may perform an OR operation of the first input operand ao-3 and a second input operand bo-3, in case the third input operand C0-3 is Boolean value of 2. Further, the logic microcell 502a-d may perform an EX-OR operation of the first input operand ao-3 and a second input operand bo-3, in case the third input operand C0-3 is Boolean value of 3.

[0057] In an embodiment, the three input operands ao-3, bo-3 and C0-3 of the logic microcell 502a-d may be represented by the equations: a = data_in_l (16) b = data_in_0 (17) c = func (18)

[0058] In an exemplary embodiment, data_in_0 represents the 2-bit input as first input operand and data_in_l represents the 2-bit input as second input operand and func represents the 2-bit input as third input operand depicting the type of bit-wise operation to be performed on the input operands ao-3 and bo-3.

[0059] In an embodiment, the logic microcell may implement bit-wise logical functions NOT, AND, OR and Ex-OR of the input pairs connected to ao-3 and bo-3 depending upon whether co-3 is 00, 01 , 10 or 1 1 respectively.

[0060] In an embodiment, the two output operands zo-3 and y of the logic microcell 502a-d may be connected as represented by the equations: z = data_out_l (19) y = func (20)

[0061] The computation of x for a 2-bit logic micro-cell is based on the following logic: if func = 0),z = data_in_oy (21 ) else if func = l),z = data_in_l & data_in_0 (22) else if (func = 2), z = data_in_l 11 data_in_0 (23) else if (func = 3),z = data_in_l ® data_in_0 (24)

[0062] FIG. 6 illustrates a bi-directional shifter data path element 600, in accordance with an embodiment of the present disclosure. The bi-directional shifter data path element 600 may perform the atomic operation of a bi-directional shift of the at least two sub-atomic data. In an embodiment, the bi-directional shifter may comprise at least

13

SUBSTITUTE SHEET (RULE 26) one mask microcell 602 performing a mask operation, at least two shift microcells 604a-d each performing a bit-wise left shift operation, and at least one OR microcell 606a-c performing an OR operation.

[0063] In an embodiment, the one of the input of the mask microcell 602 shamt may provide the number of bits to be shifted in the left or the right direction. In an embodiment, if the value of shamt is positive, the bi-directional shifter achieves left shift, while if it is negative and specified in 2’s complement form, it achieves right shift. [0064] The shift microcells 604a-d of the microcell library may be configured to perform a left bitwise shift operation on one of a second input operand ino-3 based on a number of bit-positions specified by the third input operand shamt from the three input operands. In an embodiment, one input operand is disconnected or fed as 0 to mask microcell 602 and shift microcells 604a-d.

[0065] The output xo-3 represents the lower half bits of the shifted input and the output yo-3 represents the upper half bits of the shifted input. In an embodiment, the lower half bits are padded with zeroes in the right bit positions and the upper half bits are padded with zeroes in the left bit positions based on the shamt value.

[0066] The outputs from adjacent shift microcells 604a-d are then logically “ORed” using OR microcells 606a-c each performing a logical OR operation on the outputs of the consecutive shift microcells 604a-d and relevant shifted outputs are finally selected by using mux microcells 608a-d each performing a mulplexing operation by selecting one of the two inputs as its output. This novel scheme works well even for higher valency designs. The only constraint is that the shift amount cannot be more than the valency of implementation. This, however, is not a constraint in hardware realizations as any remaining shifting can be trivially achieved by rewiring or barrel shifters 604a- d. It achieves a throughput of 1 cycle, a latency of 7 cycles, and an initial delay of 4 cycles.

[0067] FIG. 7 illustrates a multiplexer data path element 700, in accordance with an embodiment of the present disclosure. The multiplexer data path element 700 may perform the atomic operation of selecting one of the inputs from the at least two atomic data. In an embodiment, the multiplexer data path element 700 may comprise at least two mux microcells 702a-d implemented in cascade topology, each performing a multiplexing operation. All of the mux microcells 702a-d may achieve throughput and initial delay of 1 cycle and latency of 5, 4, and 4 cycles respectively.

14

SUBSTITUTE SHEET (RULE 26) [0068] Each of the mux microcells 702a-d of the microcell library may be configured to perform a multiplexing operation to select first output operand X0-3 equal to a first input operand ao-3 from the three input operands ao-3, bo-3 and C0-3 in case the third input operand C0-3 is equal to “0”. In an embodiment, each of the mux microcell 702a- d may be configured to perform a multiplexing operation to select at first output operand xo-3 equal to a second input operand bo-3 from the three input operands ao-3, bo-3 and C0-3 in case the third input operand C0-3 is equal to “3”. In an embodiment, the mux microcell 702a-d may be configured to replicate a third input operand C0-3 from the three input operands a, b and c at the second output operand yo-3. In an embodiment, the three input ports of the three input operands a, b and c and the two output ports x and yare connected as: c = select (30) b = data_in_0 (31 ) a = data_in_l (32) x = data_out (33) y = select (34)

[0069] In an exemplary implementation of a 2-bit mux microcell 702a-d may be represented by the following logic: if (c = 0), x = b (35) else if (c = 3),x = a (36) y = c (37)

[0070] FIG. 8 illustrates a demultiplexer data path element 800, in accordance with an embodiment of the present disclosure. The demultiplexer data path element 800 may perform the atomic operation of steering the input ao-3 to one of two output paths xo-3 and yo-3- In an embodiment, the demultiplexer data path element 800 may comprise at least two demux microcells 802a-d implemented in cascade topology and each performing a demultiplexing operation.

[0071] In an embodiment, each of the demux microcells 802a-d of the microcell library may be configured to perform a demultiplexing operation to either output a first output operand xo-3 equal to input operand ao-3 in case the third input operand C0-3 is equal to “0”. In an embodiment, each of the demux microcells 802a-d may be configured to

15

SUBSTITUTE SHEET (RULE 26) output the second output operand yo-3 equal to the input operand ao-3 in case the third input operand C0-3 is equal to “3”. In an embodiment, the second input operand b is left unconnected. In an embodiment, the three input ports of the two input operands b and c and the two output ports x and yare connected as: c = select (38) a = data_in (39) x = data_out_0 (40) y = data_out_l (41 )

[0072] In an exemplary implementation of a 2-bit demux microcell 802a-d the functional semantics may be represented by the following logic: if (c = 0), x = a (42) else if (c = 3),y = a (43)

[0073] FIG. 9 illustrates a comparator data path element 900, in accordance with an embodiment of the present disclosure. The comparator data path element 900 may perform the atomic operation of compare on the at least two atomic data a and b. In an embodiment, the comparator data path element 900 may comprise at least two compare microcells 902a-d connected to each other in cascade topology and each configured to perform a compare operation. Further, the comparator data path element 900 may include one mask microcell 904 which may be configured to perform a mask operation. The output of the last compare microcells 902a-d connected in cascade topology is masked as per the input func to receive the final output relation of the comparator data path element 900.

[0074] In an embodiment, the connection shown in FIG. 9 of the compare microcells 902a-d and mask microcell 904 may perform a compare operation to output as “0” in case a first input operand a is equal to a second input operand b, or output as “1” in case a first input operand a is greater than a second input operand b, or output as “2” in case a first input operand a is less than a second input operand b.

[0075] In an embodiment, each of the compare microcells 902a-d may be configured to output the second output operand yo-3 as “1” in case the first input operand ao-3 is greater than the second input operand bo-3 or in case the first input operand ao-3 is equal to the second input operand bo-3 and the third input operand C0-3 is equal to “1”.

16

SUBSTITUTE SHEET (RULE 26) In an embodiment, the compare microcell 902a-d may be configured to output the second output operand yo-3 as “2” in case the first input operand ao-3 is less than the second input operand bo-3 or in case the first input operand ao-3 is equal to the second input operand bo-3 and the third input operand C0-3 is equal to “2”. In an embodiment, the three input ports of the three input operands a, b and c and one output port y are connected as: c = compare_so_far (44) b = data_in_0 (45) a = data_in_l (46) y = comparison out (47)

[0076] In an embodiment, according to an exemplary implementation of a 2-bit compare microcell 902a-d the computation of yo-3 for 2-bit input of ao-3, bo-3 and C0-3 is based on functional semantic represented by the following logic: if (b = a & c = 0),y = 0 (48) else if (b > a \ b = a &. c = 1), y = 1 (49) else if (b < a | b = a & c = 2), y = 2 (50)

[0077] In an embodiment, each of the two outputs of the microcells of the microcell library may be latched to registers which may be synchronized with respect to clock cycle.

[0078] Further, the data path elements 200, 300, 400, 500, 600, 700, 800 and 900 may be implemented for valency of, but not limited to, 2-bits, 4-bits, a byte, a half-word (16-bits), a word (32 bits) and so on to create combinatorial topologies comprising microcells 100 for various valencies.

[0079] It is worth mentioning that the latencies of data-path elements lose their significance in subscalar designs and are thus irrelevant when used in larger algorithmic data-paths. The throughput for all the subscalar data-path elements 200, 300, 400, 500, 600, 700, 800 and 900 described above are 1 cycle each which helps in achieving overall throughput of 1 cycle per instruction for the entire data-path irrespective of the fact whether the instructions have any data dependence or not. The initial delay is not much and gets amortized over long runs of instructions. In the case of dependent operations and in the case of loops, higher initial delay has a detrimental

17

SUBSTITUTE SHEET (RULE 26) effect. This does not impact the overall gains too much as evident in the results presented in FIG. 10A, FIG. 10B, and FIG. 10C. All the circuits, however, consume much less area and consequently much less power as compared to their state-of-the- art high-speed implementations. The speedups are negligibly small and even negative in some cases of superpipelined instances. This is due to the data dependencies exhibited by the benchmark programs evaluated. Forwarding (bypassing) resolves the true data dependencies in the standard baseline 5-stage in-order-issue processor considered in this thesis. However, in cases of its superpipelined version, the data dependent instruction has to suffer stalls till the instruction on which it is dependent produces its result. This can be prohibitively large. At the same time the Silicon area needed for realization of superpipelined versions are higher due to a couple of reasons. First of all many more pipelining registers are needed. Apart from this, the hazard detection and resolution hardware is quadratically more complex. The areathroughput figure-of-merit is thus even worse. The subscalar versions do not suffer from this bottleneck. Evidently, the next data dependent instruction can initiate its execution in the very next cycle, even if the earlier instruction has not produced its complete result. This is achieved because the earlier instruction has at least produced the least significant data fragment of the result upon which is consumed by the least significant data fragment of the data later instruction. This results in phenomenal speedups. Moreover, the hazard detection and resolution hardware needed in their case is exactly same as that of the baseline processor. At the same time, the implementation logic for all the data-path elements is much less Silicon intensive. This gets reflected in much larger area-throughput figure-of-merit.

[0080] The area estimates may be reduced by a factor of almost two and a half when implemented using subscalar computing methodology as disclosed in details in the concurrently filed Indian patent application titled “System and Method For Implementation of Computational Logic Using Digital VLSI Systems” and “Microcell Library For Implementation Of Computational Logic Using Digital VISI Systems” and the IEEE paper titled “Novel VLSI Architectures and Micro-Cell Libraries for Subscalar Computations” each incorporated herein in entirety by reference.

[0081] Accordingly, the implementation of computational logic using subscalar methodology may preserve the data width and by processing smaller fragments of fullwidth data gainfully to reduce the complexities either in space, or time, or both.

18

SUBSTITUTE SHEET (RULE 26) [0082] The subscalar computational logic implements an overlapped execution of data-dependent or independent plurality of atomic operations. A subscalar computing unit (not shown) may perform various atomic operations which may be based on one or more logical computational logics such as addition, subtraction, multiplication, shift, mux, de-mux, etc. implemented using microcell library of the present disclosure to output a resultant data. The throughput achieved in the subscalar computation methodology is approximately five-time units per iteration which is comparatively lesser than the throughput achieved using conventional computation methodologies which also have throughput with latency of up to nine time units per iteration.

[0083] It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processors or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate logic primitives (or partial processing units) of various data-path elements in the context of the present disclosure may be implemented as, but not limited to, microcells, discrete components, reconfigurable look up tables or ROM cells, etc. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

[0084] Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention may not be limited only by the claims. Additionally, although a feature may appear to be described in connection with embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention.

[0085] Furthermore, although individually listed, a plurality of means, elements or process steps may be implemented by, for example, a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category as claimed in claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.

19

SUBSTITUTE SHEET (RULE 26)