Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PLANAR-STAGGERED ARRAY FOR DCNN ACCELERATORS
Document Type and Number:
WIPO Patent Application WO/2022/124993
Kind Code:
A1
Abstract:
A memory device for deep neural network, DNN, accelerators, a method of fabricating a memory device for deep neural network, DNN, accelerators, a method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator, a memory device for a deep neural network, DNN, accelerator, and a deep neural network, DNN, accelerator. The method of fabricating a memory device for deep neural network, DNN, accelerators comprises the steps of forming a first electrode layer comprising a plurality of bit-lines; forming a second electrode layer comprising a plurality of word-lines; and forming an array of memory elements disposed at respective cross-points between the plurality of word-lines and the plurality of bit-lines; wherein at least a portion of the bit-lines are staggered such that a location of a first cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to the cross-point between said bit-line and a second word-line adjacent the first word-line;.or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit-line is displaced along a direction of the bit-lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line.

Inventors:
VELURI HASITA (SG)
THEAN VOON YEW AARON (SG)
LI YIDA (SG)
TANG BAOSHAN (SG)
Application Number:
PCT/SG2021/050778
Publication Date:
June 16, 2022
Filing Date:
December 10, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NAT UNIV SINGAPORE (SG)
International Classes:
G06N3/04; G06N3/067; G11C7/18; G11C8/14
Foreign References:
US5864496A1999-01-26
CN106935258A2017-07-07
US20130339571A12013-12-19
CN111260048A2020-06-09
US20200175363A12020-06-04
CN111985602A2020-11-24
Other References:
VELURI H. ET AL.: "A Low- Power DNN Accelerator Enabled by a Novel Staircase RRAM Array", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM S, 20 October 2021 (2021-10-20), pages 1 - 12, XP055952526, [retrieved on 20220302], DOI: 10.1109n-NNLS.2021.3118451
LIN PENG, LI CAN, WANG ZHONGRUI, LI YUNNING, JIANG HAO, SONG WENHAO, RAO MINGYI, ZHUO YE, UPADHYAY NAVNIDHI K., BARNELL MARK, WU Q: "Three-dimensional memristor circuits as complex neural networks", NATURE ELECTRONICS, vol. 3, no. 4, 1 April 2020 (2020-04-01), pages 225 - 232, XP055952525, DOI: 10.1038/s41928-020-0397-9
Attorney, Agent or Firm:
VIERING, JENTSCHURA & PARTNER LLP (SG)
Download PDF:
Claims:
CLAIMS

1. A memory device for deep neural network, DNN, accelerators, the memory device comprising: a first electrode layer comprising a plurality of bit-lines; a second electrode layer comprising a plurality of word-lines; and an array of memory elements disposed at respective cross-points between the plurality of wordlines and the plurality of bit- lines; wherein at least a portion of the bit-lines are staggered such that a location of a cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to a cross-point between said bit-line and a second word-line adjacent the first wordline; or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line.

2. The memory device of claim 1, wherein at least a portion of the bit-lines are staggered and the array of memory elements comprises a plurality of array-structures, ASs, each AS comprising a set of adjacent word-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

3. The memory device of claims 1 or 2, configured to have a digital to analog converter, DAC, circuit coupled to the bit-lines for inference processing.

4. The memory device of claim 3, comprising a connection layer separate from the first and second electrode layers for connecting intermediate bit-line inputs disposed between adjacent ones of the word-lines to the DAC circuit for inference processing.

5. The memory device of any one of claims 1 to 4, configured to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the word-lines for inference processing.

6. The memory device of claim 1 , wherein at least a portion of the word-lines are staggered and the array of memory elements comprises a plurality of array-structures, ASs, each AS comprising a set of adjacent bit-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

7. The memory device of claims 1 or 6, configured to have a digital to analog converter, DAC, circuit coupled to the word-lines for inference processing.

33

8. The memory device of claim 7, comprising a connection layer separate from the first and second electrode layers for connecting intermediate word -line inputs disposed between adjacent ones of the bit-lines to the DAC circuit for inference processing.

9. The memory device of any one of claims 1 and 6 to 8, configured to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the bit-lines for inference processing.

10. The memory device of any one of claims 1 to 9, wherein each memory element comprises a switching layer sandwiched between the bottom and top electrode layers.

11. The memory device of claim 10, wherein the switching layer comprises AI2O3, SiCT, HfO2, M0S2, TaOx, TiO2, ZrO2, ZnO, GeSbTe, Cu-GeSex etc.

12. The memory device of claims 10 or 11, wherein at least one of the bottom and top electrode layers comprises an inert metal such as Platinum, Palladium, Gold, Silver, Copper, Tungsten etc.

13. The memory device of any one of claims 10 to 12, wherein at least one of the bottom and top electrode layers comprises a reactive metal such as Titanium, TiN, TaN, Tantalum etc.

14. A method of fabricating a memory device for deep neural network, DNN, accelerators, the method comprising the steps of: forming a first electrode layer comprising a plurality of bit-lines; forming a second electrode layer comprising a plurality of word-lines; and forming an array of memory elements disposed at respective cross-points between the plurality of word- lines and the plurality of bit-lines; wherein at least a portion of the bit-lines are staggered such that a location of a first cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to the cross-point between said bit-line and a second word-line adjacent the first word-line;. or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line

15. The method of claim 14, wherein at least a portion of the bit-lines are staggered and the array of memory elements comprises a plurality of array-structures, ASs, each AS comprising a set of adjacent word-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

16. The method of claims 15 or 16, comprising configuring the memory device to have a digital to analog converter, DAC, circuit coupled to the bit-lines during inference processing.

34

17. The method of claim 16, comprising forming a connection layer separate from the first and second electrode layers for connecting intermediate bit-line inputs disposed between adjacent ones of the word-lines to the DAC circuit during inference processing.

18. The method of any one of claims 14 to 17, comprising configuring the memory device to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the word-lines during inference processing.

19. The method of claim 14, wherein at least a portion of the bit-lines are staggered and the array of memory elements comprises a plurality of array-structures, ASs, each AS comprising a set of adjacent bit-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

20. The method of claims 14 or 19, comprising configuring the memory device to have a digital to analog converter, DAC, circuit coupled to the word-lines during inference processing.

21. The method of claim 20, comprising forming a connection layer separate from the first and second electrode layers for connecting intermediate word -line inputs disposed between adjacent ones of the bit-lines to the DAC circuit during inference processing.

22. The method of any one of claims 14 and 19 to 21, comprising configuring the memory device to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the bit-lines during inference processing.

23. The method of any one of claims 14 to 22, wherein each memory element comprises a switching layer sandwiched between the bottom and top electrode layers.

24. The method of claim 23, wherein the switching layer comprises AI2O3, SiCh, HfCh, M0S2, TaOx, TiO2, ZrO2, ZnO, GeSbTe, Cu-GeSex etc.

25. The method of claims 23 or 24, wherein at least one of the bottom and top electrode layers comprises an inert metal such as Platinum, Palladium, Gold, Silver, Copper, Tungsten etc.

26. The method of any one of claims 23 to 25, wherein at least one of the bottom and top electrode layers comprises a reactive metal such as Titanium, TiN, TaN, Tantalum etc.

27. A method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator, comprising the steps of: transforming the kernel using transforming the feature map using splitting [Ai] using splitting [Ui] using (min([A])) performing a state transformation on [Mi], [M2], [M3], and [M4] to generate memory device conductance state matrices to be used to program memory elements of the memory device; and using [Bi] and [U2] to determine respective pulse widths matrices to be applied to word- lines/bit-lines of the memory device.

28. The method of claim 27, wherein performing a state transformation on [Mi], [M2], [M3], and [M4] to generate the memory device conductance state matrices is based on a selected quantization step of the DNN accelerator.

29. The method of claim 28, wherein using [Bi] and [U2] to determine respective pulse widths matrices is based on the selected quantization step of the DNN accelerator.

30. The method of claim 29, comprising splitting each of [Mi] and [M2] using equations equivalent t performing a state transformation on the resultant split matrices to generate additional memory device conductance state matrices to be used to program memory elements of the memory device, for increasing an accuracy of the DNN accelerator.

31. A memory device for a deep neural network, DNN, accelerator configured for executing the method of any one of claims 27 to 30.

32. A deep neural network, DNN, accelerator comprising a memory device of any one of claims 1 to 13, 31.

Description:
PLANAR-STAGGERED ARRAY FOR DCNN ACCELERATORS

FIELD OF INVENTION

The present invention relates broadly to a memory device for deep neural network, DNN, accelerators, a method of fabricating a memory device for deep neural network, DNN, accelerators, a method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator, a memory device for a deep neural network, DNN, accelerator, and a deep neural network, DNN, accelerator; specifically to the development of an architecture for efficient execution of convolution in Deep convolutional neural networks.

BACKGROUND

Any mention and/or discussion of prior art throughout the specification should not be considered, in any way, as an admission that this prior art is well known or forms part of common general knowledge in the field.

The recent advances in low-power Deep Neural Network (DNN) accelerators provide a pathway to infuse the connected devices with the required communication and computational capabilities to revolutionize our interactions with the physical world. As untethered computing using DNNs at the edge of loT is limited by the power source, the power-hungry high performance servers required by GPU/ ASIC -based DNNs act as the deterrent to their wide spread deployment. This bottleneck motivates the investigation of more efficient but specialized devices and architectures.

Resistive Random-Access Memories (RRAMs) are memory devices capable of continuous non-volatile conductance states. By leveraging the RRAM crossbar’s ability to perform parallel in-memory multiply-and-accumulate computations, one can build compact, high-speed DNN processors. However, convolution execution (Figure 1(a)) and simultaneous output feature map generation using planar crossbar arrays with the Manhattan layout (Figure 1(b)) require unfolding input matrices into vectors and massive input regeneration, both of which lead to increased power and area consumption.

Current state-of-the-art RRAM array-based DNN accelerators overcome the above issues and enhance performance by combining the RRAM with multiple architectural optimizations. For example, one existing RRAM array-based DNN accelerator improves system throughput using an interlayer pipeline but could lead to pipeline bubbles and high latency. Another existing RRAM array-based DNN accelerator employs layer-by-layer output computation and parallel multi-image processing to eliminate dependencies, yet it increases the buffer sizes. Another existing RRAM array-based DNN accelerator increases input reuse by engaging register chain and buffer ladders in different layers, but increases bandwidth burden. Using a multi-tiled architecture where each tile computes partial sums in a pipelined fashion also increases input reuse. Another existing RRAM array-based DNN accelerator employs bidirectional connections between processing elements to maximize input reuse while minimizing interconnect cost. Another existing RRAM array-based DNN accelerator maps multiple filters onto a single array and reorders inputs, outputs to generate outputs parallelly. Other existing RRAM array-based DNN accelerators exploit the third dimension to build 3D-arrays for performance enhancements.

However, the system-level enhancements that most reported works employ result in hardware complexities. The differential technique (Figure 1(b)) that they utilize for signed floating-point computations, and usage of a 16-bit input resolution impede significant throughput improvement and power reduction owing to increased clock cycles and interface accesses. Typical 3D-RRAM implementations using though-silicon vias (TSVs) face similar image unfolding and regeneration issues. Though 3D-arrays with staircase routing (Staggered-3D) improve throughput, they suffer from high via-resistance that limits the number of RRAM layers and increases peripheral circuitry. Besides, the intrinsic analog nature of computations within crossbar arrays renders them highly susceptible to the parasitic I-R drop and the RRAM’s current nonlinearity, limited conductance range. Thus, there is a need for layout optimizations and a hardware-aware in-memory compute methodology to overcome the mentioned weaknesses and circuit overheads.

Embodiments of the present invention seek to address at least one of the above needs.

SUMMARY

In accordance with a first aspect of the present invention, there is provided a memory device for deep neural network, DNN, accelerators, the memory device comprising: a first electrode layer comprising a plurality of bit-lines; a second electrode layer comprising a plurality of word-lines; and an array of memory elements disposed at respective cross-points between the plurality of wordlines and the plurality of bit- lines; wherein at least a portion of the bit-lines are staggered such that a location of a cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to a cross-point between said bit-line and a second word-line adjacent the first wordline; or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line. In accordance with a second aspect of the present invention, there is provided a method of fabricating a memory device for deep neural network, DNN, accelerators, the method comprising the steps of: forming a first electrode layer comprising a plurality of bit-lines; forming a second electrode layer comprising a plurality of word-lines; and forming an array of memory elements disposed at respective cross-points between the plurality of word- lines and the plurality of bit-lines; wherein at least a portion of the bit-lines are staggered such that a location of a first cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to the cross-point between said bit-line and a second word-line adjacent the first word-line;. or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line

In accordance with a third aspect of the present invention, there is provided a method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator, comprising the steps of: transforming the kernel using transforming the feature map using splitting [Ai] using splitting [Ui] using performing a state transformation on [Mi], [M2], [M3], and [M4] to generate memory device conductance state matrices to be used to program memory elements of the memory device; and using [Bi] and [U2] to determine respective pulse widths matrices to be applied to word- lines/bit-lines of the memory device.

In accordance with a fourth aspect of the present invention, there is provided a memory device for a deep neural network, DNN, accelerator configured for executing the method of the third aspect.

In accordance with a fifth aspect of the present invention, there is provided a deep neural network, DNN, accelerator comprising a memory device of first or fourth aspects. BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

Figure 1(a) shows a schematic drawing illustrating operations involved in the convolution of a kernel with an input image.

Figure 1(b) shows a schematic drawing illustrating typical in-memory convolution execution within planar arrays using differential technique that requires matrix unfolding and input regeneration.

Figure 1(c) shows a schematic drawing illustrating a planar-staircase array that inherently shifts inputs, reduces input regeneration and parallelizes output generation, according to an example embodiment.

Figure 1(d) shows a schematic drawing illustrating the architecture of an accelerator with pipelining [9], Ex-IO IF: External IO interface.

Figure 1(e) shows a flowchart illustrating an in-memory compute methodology according to an example embodiment, ST: State Transformation.

Figure 1(f) shows a schematic drawing illustrating the procedure for the in-memory M2M methodology for neural networks, according to an example embodiment. Black boxes represent the matrix stored within arrays, the gray boxes represent the matrix applied as input pulses.

Figure 2(a) shows an SEM image of a fabricated sub-array for a 5x5 Kernel with 22 inputs and 18 outputs, according to an example embodiment.

Figure 2(b) shows the DC curve of planar- staircase AI2O3 RRAM devices according to example embodiments, over 50 cycles.

Figure 2(c) shows the cumulative probability distribution of set and reset voltages for 15 devices according to example embodiment, over 50 cycles, showing a tight distribution, D2D: Device-to-Device, C2C: Cycle-to-Cycle.

Figure 2(d) shows 5x linear conductance modulation of 15 RRAM devices according to example embodiment, over 100 reset pulses with low D2D variability (bars), where Current Compliance (CC)=lmA.

Figure 2(e) shows a comparison of a developed spice model with experimental data, showing good correlation according to example embodiments.

Figure 3(a) relates to RRAM array according to an example embodiment parasitic evaluation, where technology node: 40nm, V re ad=0.1V, and the array is assumed to have copper routes, specifically the effect of via and line parasitic resistance on the current flowing through staircase array outputs as a function of kernel size and outputs/AS, #AS: Kernel_columns, Total outputs from the array=#ASx(Outputs/AS).

Figure 3(b) relates to RRAM array according to an example embodiment parasitic evaluation, where technology node: 40nm, V re ad=0.1V, and the array is assumed to have copper routes, specifically the effect of via and line parasitic resistance on the current flowing through staircase array outputs as a function of kernel size and outputs/AS, #AS: Kernel_columns, Total outputs from the array=#ASx(Outputs/AS).

Figure 3(c) relates to RRAM array according to an example embodiment parasitic evaluation, where technology node: 40nm, V re ad=0.1V, and the array is assumed to have copper routes, specifically worst case current flowing through array outputs as a function of #AS, Outputs/AS=26.

Figure 3(d) relates to RRAM array according to an example embodiment parasitic evaluation, where technology node: 40nm, V re ad=0.1V, and the array is assumed to have copper routes, specifically line delay as a function of #AS, Outputs/AS=26.

Figure 3(e) relates to RRAM array according to an example embodiment parasitic evaluation, where technology node: 40nm, V re ad=0.1V, and the array is assumed to have copper routes, specifically worst case current as a function of Kernel size for different layouts.

Figure 4(a) relates to M2M evaluation according to an example embodiment, where RR AMres=log2(RR AMstates) , Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically output error (OE) for floating-point matrix convolution as a function of RRAM resolution.

Figure 4(b) relates to M2M evaluation according to an example embodiment, where RRAMres=log 2 (RR AMstates), Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically OE for floating-point matrix convolution as a function of input pulse levels for RRAMres=lb, 6b.

Figure 4(c) relates to M2M evaluation according to an example embodiment, where RRAMres=log 2 (RR AMstates), Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically OE as function of X. X=fxmax([Z]); [Z] is the matrix being split, f=0.25/0.5/0.75.

Figure 4(d) relates to M2M evaluation according to an example embodiment, where RRAMres=log 2 (RR AMstates), Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically ADCres as a function of RRAM res and contributing inputs, Pulse re s=3/6. Figure 4(e) relates to M2M evaluation according to an example embodiment, where RR AMres=log2(RR AMstates) , Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically ADCres as a function of RRAM res and contributing inputs, Pulse re s=3/6.

Figure 4(f) relates to M2M evaluation according to an example embodiment, where RRAMres=log 2 (RR AMstates), Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically System Power consumed by planar staircase arrays per convolution as a function of ES, ES: RRAMres-Pulseres.

Figure 4(g) relates to M2M evaluation according to an example embodiment, where RRAMres=log 2 (RR AMstates), Pulse re s=log2( Pulseieveis), ES: Encoding Scheme, x: number of times a matrix has been split, Ux/Sx: unsigned/signed floating-point matrix convolution, specifically comparison of power per convolution for various ES with OE<5%, ES: Sx- RRAM r es-Pulse res-

Figure 5(a) shows a 4-layer DCNN flowchart for MNIST[23] classification and different processes involved, according to an example embodiment.

Figure 5(b) shows MNIST [23] Classification accuracy for a method according to an example embodiment vs GPU for a 3 -layer DCNN with floating-point numbers for different encoding schemes;.

Figure 5(c) shows MNIST [23] Classification Accuracy comparison between Sl_4_3 scheme according to an example embodiment & GPU for different DCNNs (a 3-layer CNN and a 4- layer CNN), CN: Convolutional Layer; FC: Fully connected Layer; SM: Softmax Layer.

Figure 6(a) shows the Sl_4_3 ES analysis, specifically power consumed by the staircase array according to an example embodiment as a function of Outputs/AS, #AS=26.

Figure 6(b) shows the Sl_4_3 ES analysis, specifically area required by the staircase array according to an example embodiment as a function of Outputs/AS, #AS=26.

Figure 6(c) shows the Sl_4_3 ES analysis, specifically power consumed by the staircase array according to an example embodiment as a function of #AS .

Figure 6(d) shows the Sl_4_3 ES analysis, specifically area required by the staircase array according to an example embodiment as a function of #AS .

Figure 6(e) shows the Sl_4_3 ES analysis, specifically a comparison of power consumed by different layouts for the parallel output generation of a 28x28 image convolution with kernels, according to an example embodiment.

Figure 6(f) shows the Sl_4_3 ES analysis, specifically a comparison of area consumed by different layouts for the parallel output generation of a 28x28 image convolution with kernels, according to an example embodiment. Figure 7 shows a flowchart illustrating a method of fabricating a resistive random-access memory, RRAM, device for deep neural network, DNN, accelerators, according to an example embodiment.

Figure 8 shows a flowchart illustrating a method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator according to an example embodiment.

DETAILED DESCRIPTION

In an example embodiment, a hardware-aware co-designed system is provided that combats the above-mentioned issues and improves performance, with the following contributions:

• A planar-staircase array according to an example embodiment (Figure 1(c)).

• Combining the novel planar-staircase array (Figure 1(c)) with a hardware-aware inmemory compute method to design an accelerator (Figure 1(d)) that enhances peak power-efficiency.

• By reducing the number of devices connected to each input, the planar-staircase RRAM array according to an example embodiment alleviates I-R drop and sneak current issues to enable an exponential increase in crossbar array size compared to Manhattan arrays. The layout can be further extended to other emerging memories such as CBRAMs, PCMs.

• Eliminate input unfolding and reduce regeneration by performing convolutions through voltage application at the staircase-routed bottom electrodes and current collection from the top electrodes (Figure 1(c)). Power can be reduced by -68% and area by -73% per convolution output generation, compared to a Manhattan array execution.

• An in-memory Matrix-Matrix multiplication (M2M) method according to an example embodiment (Figures 1(e) and (f)) accounts for device and circuit issues to map arbitrary floating-point matrix values to finite RRAM conductances and can effectively combat device variability and nonlinearity. It can be extended to other crossbar structures/devices by replacing the circuit/device models.

• Using the conversion algorithm according to an example embodiment, the output error (OE) can be reduced to <3.5% for signed floating-point convolution with low device usage and input resolution.

• Irrespective of the number of kernels operating on each image, an example embodiment can process the negative floating-point elements of all the kernels within 4 RRAM arrays using the M2M method according to an example embodiment. This reduces the device requirement and power utilization.

• The hardware-aware system according to an example embodiment achieves >99% MNIST classification accuracy for a 4-layer DNN using a 3 -bit input resolution and 4-bit RRAM resolution. An example embodiment improves power-efficiency by 5. lx and area-efficiency by 4.18x over state-of-the-art accelerators.

Convolutional Neural Network (CNN) Basics

DNNs, typically consist of multiple convolution layers for feature extraction followed by a small number of fully-connected layers for classification. In the convolution layers, the output feature maps are obtained by sliding multiple 2-dimensional (2D) or 3-dimensional (3D) kernels over the inputs. These output feature maps are usually subjected to max pooling, which reduces the dimensions of the layer by combining the outputs of neuron clusters within one layer into a single neuron in the next layer. A cluster size of 2x2 is typically used and the neuron with the largest value within the cluster is propagated to the next layer. Max-pool layer outputs, subjected to activation functions such as ReLU/ Sigmoid, are fed into a new convolution layer or passed to the fully-connected layers. Equations for convolution of x input images ([B]) with kernels ([A]mxii 1,p ) and subsequent max -pooling with a cluster size of 2x2 to obtain output [C] 1 are given below:

In an example embodiment, the focus is on the acceleration of the inference engine where the weights have been pre-trained. Specifically, an optimized system for efficient convolution layer computations is provided according to an example embodiment, since they account for more than 90% of the total computations.

RRAM-based In-memory Computation

Previously reported in-memory vector-matrix multiplication techniques store weights of the neural network as continuous analog device conductance levels and employ pulse-amplitude modulation for the input vectors to perform computations within the RRAM array (Figure 1(b)). Upon voltage pulse application at the word-line inputs, Ohm’s and Kirchoff s laws determine the current flowing through each bit-line. Sense amplifiers (SAs) combined with basic hold circuits convert the bit-line current to voltage and hold the analog output to enable Analog-to-Digital Converter (ADC) sharing to save computational power. ADC outputs obtained after converting the crossbar’s voltage outputs to digital signals are mapped-back to floating-point elements using non-linear map-back functions. However, such execution increases peripheral overheads and results in high susceptibility to noise. An example embodiment aims to reduce the periphery and improve the robustness of the system.

Planar Staircase Array according to an example embodiment

As mentioned above, most reported works use a 2D-planar layout (Manhattan layout) ] that requires matrix unfolding into vectors and massive input regeneration (Figure 1(b)) for convolution operations. To eliminate these issues and increase Input feature map reuse, a planar RRAM array 100 with staircase routing for the bit-lines e.g. 102 which constitute the bottom electrode layer (Figure 1(c)) is provided. In the layout according to an example embodiment, each bit-line e.g. 102 gets connected to one or more RRAMs cells e.g. 104, 106 along different levels of the array 100 storing different kernel elements, based on the outputs each input signal contributes to. In other words, at least a portion of the bit-lines e.g. 102 are staggered such that a location of a first cross-point between the bit-line e.g. 102 and a first word-line e.g. 105 (i.e. RRAM cell 104) is displaced along a direction of the word-lines compared to the cross-point between the bit-line e.g. 102 and a second word-line e.g. 103 adjacent the first word-line e.g. 105 (i.e. RRAM cell 106). In the example embodiment, the RRAM cells e.g. 104, 106 are programmed by applying programming pulses to the word-lines e.g. 103, 105 in the top electrode layer.

The staircase routing for the bit-lines e.g. 102 results in the auto-shifting of inputs and facilitates the parallel generation of convolution output with minimal input regeneration. From Figure 1(c), it can be observed that the output generation using the layout according to an example embodiment does not require matrix unfolding as each sub-array e.g. 112 is configured to take inputs from the same row of the input matrix e.g. bsi-bss and to have the elements of a row of a kernel (e.g. asi, a32, and ass) applied in the DNN accelerator contributing to the output. This leads to lower pre-processing time.

Fabrication and Electrical Characterization according to an example embodiment

The lack of complex algorithms to map kernel elements to RRAM device locations according to an example embodiment reduces mapping complexity. After programming RRAM cells e.g. 104 (Figures 1(c) and 2(a) based on kernel values, voltage pulses are applied with duty cycle/width based on input matrix values to the bit-lines e.g. 102. Current flowing through each word-line e.g. 103 in the top electrode layer over processing time gets integrated and converted to digital signals in the analog to digital converter and sense amplifier, ADC/SA 120. A linear transformation applied to these digital signals generates the floating-point output matrix elements.

In an array according to an example embodiment, the RRAM cells e.g. 106 comprises an AI2O3 switching layer contacted by the bit-lines e.g. 102 at the bottom and the word-lines e.g. 103 at the top. The array 100 is fabricated by first defining the bottom electrode layer with the staircase bit lines (e.g. 102) layout via lithography and lift-off of the 20nm/20nm Ti/Pt deposited using electron beam evaporator. Following this, a 10 nm of AI2O3 switching layer is deposited using atomic layer deposition at 110°C. The top electrode layer with the word lines e.g. 103 is subsequently defined using another round of lithography and lift-off of 20nm/20nm Ti/Pt deposited via electron beam evaporator. The final stack of each cell e.g. 106 fabricated in the array is Ti/Pt/ AhOs/Ti/Pt. Figure 2(a) shows the SEM image of an AI2O3 staircase array 220 according to an example embodiment.

It is noted that in various example embodiments the switching layer comprises AI2O3, SiCF, HfO2, M0S2, TaO x , TiCF, ZrCF, ZnO etc., at least one of the bottom and top electrode layers comprises an inert metal such as Platinum, Palladium, Gold, Silver, Copper, Tungsten etc., and at least one of the bottom and the top electrode layers may comprise a reactive metal such as Titanium, TiN, TaN, Tantalum etc.

The RRAM DC-switching characteristics from the AI2O3 staircase array 220 according to an example embodiment show non-volatile gradual conductance reset over a lOx conductance change across a voltage range of -0.8 V to -1.8 V (Figure 2(b)). Cumulative Distribution plot of the Set/Reset voltages for 15 RRAM devices over 50 cycles (Figure 2(c)) shows a tight distribution, implying low device-to-device and cycle-to-cycle variability. Figure 2(d) confirms that the conductance curve of multiple fabricated RRAM devices according to an example embodiment as a function of 100 reset pulses demonstrates a 5x linear reduction. Here, the conductance curve is divided into 8-states (S0-S7) based on the observed device variability. For system analysis, a hysteron-based compact model, developed by Eehtonen et.al., has been calibrated to the AI2O3 RRAM according to an example embodiment. Figure 2(e) shows the HSPICE-compact model behavior for the RRAM according to an example embodiment, which demonstrates a good correlation with the experimental data. In addition, guided by the current variation displayed in Figure 2(d), a c/p of 0.2 was added to the RRAM current at each state to account for the device-to device and cycle-to-cycle variability. Due to the above measures, the simulations performed according to an example embodiment account for the various RRAM device issues and provide an accurate estimate of the output error.

The RRAM according to an example embodiment is fully compatible with CMOS technology in terms of both materials, low temperature (<120°C) suitable with back end of line (BEOE) and processing techniques employed. The AI2O3-RRAM device according to an example embodiment is almost forming free, implying that there is no permanent damage to the device after initial filament formation, and does not limit the device yield. Therefore, the AI2O3 RRAM devices according to an example embodiment can be easily scaled down to the sub-nm range. It is noted that, the arrays fabricated at larger node in an example embodiment are used to evaluate the efficacy of the layout, and proposed in-memory compute schemes and can be replaced with other compatible materials at lower nodes.

Modifications according to example embodiments

It is noted that in different example embodiments, the lines in the top electrode layer can be staggered and function as the bit-lines, and the current can be collected from straight lines in the bottom electrode layer functioning as the word-lines. Also, it is noted that in different example embodiments, the word-lines can be staggered instead of the bit-lines. Further, the RRAM devices used in the example embodiment described can be replaced and the layout can be extended to other memories capable of in-memory computing in different example embodiment, including, but not limited to, Phase-Change Memory (PCM), and Conductive Bridging RAM (CBRAM), using materials such as, but not limited to GeSbTe, Cu-GeSe x .

Array Size Evaluation according to an example embodiment

With reference again to Figure 1(c) the complete array 100 layout according to an example embodiment for convolution execution comprises multiple sub-arrays e.g. 112, with staircase bottom electrode routing. Multiple such sub-arrays contributing to the same outputs constitute an Array- Structure (AS) e.g. 114, and numerous such AS e.g. 114 sharing bottom electrodes form the array 100. Consecutive AS e.g. 114, 116 are flipped versions of each other and connected using staircase routing. Such connections further reduce input regeneration and result in a multifold improvement in performance. In an example embodiment, the staircase array uses 3 metal layers, the BE, TE and a metal layer beneath the BE layer to enable connection of the intermediate inputs (e.g. the inputs for bu, bis, b24, b25, b34, has) to external CMOS circuits, here to external DAC 117, 118, 119. For an AS e.g. 114 with x outputs, each sub-array e.g. 112 takes ti+t2=x-l+r pulse inputs, resulting in a total array outputs of xxn. Furthermore, for outputs x > r+1, the total number of pulse inputs to the array=(ti+t2)x(n+n- l+(0.5x (n- 1) x (n-1))); while for x < r+1, the total number of DAC inputs to the array=((r+x- l)x(n+n-l))+((x-l)x(n-l)x(n-l)). Here, r is the number of kernel rows (Kernel_rows), n is the number of kernel columns (Kernel_columns), n is the number of AS in the array (#AS).

An increase in the routing length due to the staircase routing according to an example embodiment would result in larger line parasitic resistances and capacitances in the array. Hence, the effect of an increase in the outputs/AS (x) on the current was evaluated using HSpice and the results are shown in Figures 3(a) and (b). For this evaluation, the line resistance was extracted to be IQ between adjacent tracks and via resistance to be 55 Q from the full layout design according to an example embodiment in cadence and was crosschecked with previous works. Considering the filamentary nature of the AI2O3 RRAM device switching (Figure 2(e)), the increase in resistance, resulting from the elimination of leakage currents with scaling, was neglected and the power/ I-R drop analyses was based on the measurements at the 2x2 pm 2 range. It is noted that, area of the devices has been scaled based on the metal pitch at the 40nm technology node. From the device data according to an example embodiment, the resistances for different states at V re ad=0.1V (CC=100pA) were derived for the analysis. As described above, an increase in the number of outputs requires an increase in the number of inputs connected to each sub-array. The larger number of inputs leads to increased route lengths within each sub-array and between consecutive AS, resulting in the observed variation. A rise in outputs from 1 to ceil(r/2)+l sees an increment in devices connected to the input lines accompanied by a surge in line parasitic, leading to a drop in system current (Figures 3(a) and (b)). Here, ceil(x) is the integer closest in value to x and is > x. Beyond this threshold, the inputs shared between inconsecutive AS decrease, reducing the current degradation. However, a precipitous surge was observed in AS routing beyond outputs=r+l, which leads to an exponential drop in system current. Owing to this trend, the optimum number of outputs per AS=r+l for Kernels with size >7x7, according to an example embodiment. As each array is a union of multiple AS sharing inputs, it is important to understand the impact of an increase in the number of AS on the system performance according to an example embodiment. For this evaluation, a 3x3 array with 26 outputs per AS was considered and the results shown in Figures 3(c) and (d). Beyond outputs/AS (x) =r+l, an increase in AS neither alters the input route length nor the current. Hence, no significant current drop was observed with an increase in AS. This property can be exploited according to an example embodiment to build dense arrays generating many outputs to improve throughput and decrease input regeneration.

Furthermore, the staircase array output current according to an example embodiment was compared with that of the Manhattan and staggered-3D arrays in Figure 3(e). For the Manhattan array layout, inputs to array= nxr, outputs =(r+l)xn. For the staircase array according to an example embodiment, inputs per sub-array (ti+t2)=2xr, outputs (xxn)= (r+l)xn, #AS (n) =n. Descriptions of different variables remain unchanged. For the staggered-3D array, total outputs=(2r+l)xn, total inputs=(3r-l) xn, RRAMs connected to each output=r and total current shown in Figure 3(e) = (current per Output)xn. It was assumed that all the considered layouts have copper interconnects, which results in a via resistance of 55 Q and line resistance of IQ between adjacent tracks. From Figure 3(e), it can be inferred that though the longer routes increase line parasitic in planar- staircase arrays according to an example embodiment, the lower number of devices connected to each input leads to a lower current flow, thus reducing the I-R drop. This reduction makes the planar- staircase array according to example embodiment more resilient to line -parasitic compared to the Manhattan Array. Though the staggered-3D array leads to lower I-R drop and sneak current issues owing to its high via resistance, it reduces the system performance owing to larger periphery requirements, as will be discussed in more detail below.

In-Memory M2M Execution according to an example embodiment

While neural networks mandate low quantization error (QE) and high accuracy, the RRAM states (minimum of 6-bit) required to achieve this are difficult to be demonstrated on a single device. RRAM device variability further exacerbates the issue. Hence, in an example embodiment, an M2M method was delineated that achieves high output accuracy with low input resolution while combating device issues and improving throughput. To tolerate device nonlinearity and reduce interface overheads, pulse-width modulation was employed instead of amplitude-modulation to represent the input vectors (Figure 1(c)). Furthermore, to develop a system resilient to device variations, the RRAM device conductance was discretized according to an example embodiment, closer to the more stable low-resistance (LRS), based on devicevariability. Matrix A/Kernel ([A]) elements are mapped onto one of the device conductance states while input voltage pulses with pulse-width based on matrix B/input feature map ([B]) are applied to the word-lines according to an example embodiment, as depicted in Figure 1(f).

M2M methodology according to an example embodiment To facilitate the processing of signed floating-point numbers using a single array according to an example embodiment, the input matrices are split into two substituent matrices:

Thus, the output feature map, [C], becomes:

Here min([X]) represents the minimum among the elements of [X]; [Ui] is an axb dimension matrix with all its elements equal to abs(min([A])) and [U2] is an nxt matrix with each of its elements equal to abs(min([B])). Here, abs(X) gives the absolute value of X and li=Sign(min([A])), h=Sign(min([B])). Although this transformation results in four matrices [Ai], [Bi], [Ui], [U2] from the original [A] and [B], every element of resultant matrices is >0, making it possible for them to be processed using a single Crossbar array. Furthermore, the range of elements in [A] remains unaltered in [Ai], while [Ui] enables the processing of negative floating-point numbers of the input kernels. Similarly, [Bi] preserves the range of [B] while [U2] helps process its negative elements. It was detailed that a 6-bit resolution is required to achieve an output degradation of <6%. However, the demonstration of 64 low-variability states within each RRAM is difficult. Hence, a new methodology was developed according to an example embodiment that lowers RRAM state requirements by splitting the resultant matrices further (Figure 1(f)). To execute high-accuracy computations with low RRAM-states, the resultant matrix [Ai] is split into two matrices:

Based on (3), max([Ai])=max([A])+abs(min([A])). The above split generates 2 matrices [Mi] and [M2] each with element range lower compared to [Ai] : 0<Mi,ij<max([Ai])-X; 0<M2,ij<X. Lowering the range of individual matrices reduces the quantization step, thereby reducing QE. Furthermore, [Ui] is split into [M3], [M4] as (8), to reduce the effect of device non-linearity on output:

Post the split, elements of thus derived matrices are mapped to device conductance states, for in-memory computation, using the quantization step (A x ). State Matrix Derivation:

Post the matrix split detailed in (6)-(8) above, elements of the derived matrices are mapped to device conductance states, for in-memory computation, using the quantization step (A x ):

In above equations, abs(X): the absolute value of X, [M x ]: matrices derived from splitting [Ai], [M3/4]: matrices derived from splitting [Ui], 0<X<max([Ai]); Ri, R2, R3: number of RRAM conductance states used for processing [Mi], [M2], [M3/4] respectively. Similarly, derived matrices of [B] ([Bi] & [U2]) are mapped to input pulse widths using the quantization step, A2, derived as: Here, m: number of levels the input pulse has been divided into. Using this quantization step, we map elements of the derived matrices [M1/2/3/4] to RRAM conductance states and [B i]/[U2] to input pulse levels. The two state matrices ([Sztx] & [Szty]) of each of the derived matrices are determined as: When [Zt] = [Mi], AMI is used. For [M2], AM2 is used and A2 is used when [Zt]=[Bi]/[U2], for the state transformation. Such mapping of each element to 2 RRAM devices lowers output QE and combats device-variability issues. Elements of [M3] and [M4] are mapped as:

Due to the above transformation, independent of abs(min([A])), every element of the state matrices of [M3] and [M4] get mapped to 0 and R3-I respectively. Thus, irrespective of the number of kernels operating on an input matrix, [Svi3,x/y] & [S vi4,x/y] elements need to be stored and processed just once per input matrix. Each element of [SMix/y], [SM2x/y], [SM3x/y], [Svi4x7y] represents one of the RRAM conductance states (so-s m ax). Based on [SBIX], [Ssiy], [Su2x], [Su2y] read pulse width applied to word line is determined as:

Upon state matrix determination, the RRAM arrays are programmed based on the kernel’s state matrices. [SBIX]/[SU2X] elements are applied to RRAMs storing [Sxijx] elements and [SBiy]/[Su2y] are applied to [SMjy] elements (j=l ,2,3,4) (Figure 1(e)). Current flowing through the bit-lines integrated over the processing time is converted to digital signals using an ADC. The output feature map ([C]) given by (5) above, which is the convolution output of [A] and [B], is derived as:

Each of the components of S(6) are obtained as:

In S(7), the convolution of [Mt] with [BI]/[U2] is carried out within the staircase array. ADC outputs, obtained after converting the integrator outputs to digital signals, are transformed into floating-point numbers using the below equation:

Where, Vit/jt: voltage accumulated at the integrator output, c: intercept of RRAM conductance line, m: the slope of the line representing RRAM conductance, Cap } : the capacitance associated with the integrator circuit, r p = Total Pulse Width/(m-l).

For neural networks using activation functions such as ReLU/Sigmoid, min([B])=0, thus resulting in U2ij=0. Assuming that [Ai] and [Ui] have been split into 2 matrices each, S(6) evolves into S(10) for neural networks:

In the method according to an example embodiment, independent of abs(min([A])), every element of the state matrices of [M3] and [M4] get mapped to 0 and R3-I respectively. Thus, irrespective of the number of kernels operating on an input matrix, [M3] & [M4] state matrix elements need to be stored and processed just once per input matrix.

Upon state matrix determination, the RRAM arrays are programmed based on the kernel’s state matrices while state matrices of [B1]/[U2] determine the pulse widths applied to the word lines (Figure 1(e)). Current flowing through the bit-lines integrated over the processing time is converted to digital signals using an ADC. Derivation of the output feature map ([C]) given by (5), which is the convolution output of [A] and [B], requires a linear transformation as detailed above. Lack of complex functions to map-back the ADC outputs to floating-point numbers in according to an example embodiment further reduces the power consumed by digital circuits of the accelerators.

Also, the split of [Ai] & [Ui] lowers QE considerably due to the reduction in element range of the resultant matrices according to an example embodiment.

Quantization Error Calculation: Consider an element, a x ∈[A], min([A])<0 and bi∈[B]. Here, a x can be split into ai and ti as: a x =ai-ti where ai∈[Ai] and ti=abs(min([A])). Assuming min([B])=0, n 2 =floor(bi/A 2 ) and (n 2 +l)=ceil(bi/A 2 ), min([B])=0, we get:

In S(11), Hence,

The value of tixbi is calculated using the proposed method as:

The QE incurred at the output due to such mapping is:

Similar to S(14), one can calculate the QE for multiplication of ai with bi. But, unlike ti, there are 2 possibilities for ai.

Case 1:

In S(15), The output of in-memory multiplication between ai and bi is given by S(16) while the ideal output is given in S(17):

Using S(16)-S(17) and calculating we get: Since ai<X, its corresponding element in [Mi] is 0 and hence Vb=Ib-Sb=0. Hence, the final QE for a x xbi can be derived as:

Substituting one gets:

Case 2:

For this case, a x can be rewritten as:

Similar to S(14), QE for the multiplication of X and bi can be derived as:

QE for and is given as:

In S( 13), Substituting S(14), S(22) and S(23) in S(19), one gets:

Substituting one gets: Here, the expected output (T) is obtained using the RRAM crossbar array and the quantization error per multiplication is given by V x . By using both floor and ceiling state matrices for computation, one reduces the quantization error and makes it symmetric about 0.

To minimize the QE in S(20) & S(25) simultaneously, one needs to make X=ti. When X=max([Ai])/2 and for a distribution with max([A])=abs(min([A])) with RI=R2=R3=R, one gets:

Without the split given in matrix split, the resultant QE[8] is:

In above equations, : step sizes for [A], [B] respectively. Comparing S(20), S(25) with S(28), one sees that the split of [Ai] & [Ui] lowers QE considerably.

Owing to this reduction, lower number of RRAM states and pulse levels can be used for high accuracy computations when [Ai] is split. For applications requiring higher accuracy, [M1/2] can be further divided using the equations (6)-(7), to reduce QE, according to an example embodiment. As all elements of the derived matrices are >0, no changes to [M3/4], which deal with the negative floating-point elements, are made.

As every element of the state matrices of [M3/4] equals either 0 or R3-I, further split of these matrices is not required to achieve higher QE. Further, it is seen that mapping each element of the resultant matrices of [A] & [B] to 2 state matrix elements results in lowering of output QE and making it symmetric about 0. Such QE minimization increases output accuracy and enables usage of lower RRAM resolution for high-accuracy computations.

Performance evaluation of an example embodiment

Figure 2(e) shows the HSPICE compact model behavior for AI2O3 RRAM according to an example embodiment, which represents the experimental data well. A software -based memory controller unit, written in Python, interfaced with MATLAB -coded compact RRAM models emulated the planar-staircase array according to an example embodiment to implement for all aspects of the system simulation. To begin with, the variation in output error (OE) with RRAM states and input pulse levels was analysed. The effect of splitting the matrices into multiple parts on the OE was also evaluated. For this analysis, a 100x100 input ([B]) and a 9x9 kernel were considered. Two sets of simulations, one with different matrix elements chosen at random from the interval [0,1] and the other from [-1,1], were performed, with 300 test cases for each unique combination of RRAM resolution and pulse levels. Figures 4(a) and (b) show the OE incurred as a function of RRAM resolution and input-pulse levels. Here, OE is derived as:

While Figures 4(a) delineates the effect of varying RRAM resolution on the error, Figures 4(b) reports the impact of varying pulse resolution for two different RRAM resolutions. In accordance with S(23), Figures 4(a) and (b) show that an increase in RRAM resolution and pulse levels reduces OE due to the increase in the number of available bins and lower quantization step. Also, splitting the resultant matrices of [Ai] further decreases OE due to the reduced range of the final matrices thus reducing quantization step. The lowered range of the resultant matrices enables the usage of lower resolution for similar output accuracy. For input image and kernel elements with all-positive elements, OE -0.3% while it is <3.4% for matrices with signed floating-point elements. Comparing the OE for the split lower-resolution computations with unsplit high-resolution computations according to example embodiments shows that splitting the matrices results in lower OE (Figures 4(a) and (b)).

As can be seen from S(20), S(25), the value at which each matrix gets split into subsequent matrices (X) plays a crucial role in determining the OE. Hence, the effect of matrix split at different values of X on the OE was analyzed and the results documented in Figure 4(c). Kernel and input sizes remain unchanged from the previous analysis with elements drawn at random from the interval [-1,1] for the kernel and [0,1] for the input. Similar to the previous simulations, 300 test cases for each combination of RRAM and pulse resolutions were considered. From Figure 4(c) one observes that when X=max([Z])/2, equal element range for resultant matrices leads to the minimal error. The trend remains unchanged for the three considered combinations of RRAM and pulse resolutions.

Following the thorough evaluation of various parameters on the output accuracy according to example embodiment, the impact of these parameters on the system power was assessed using planar staircase arrays according to an example embodiment with 120 outputs. As ADC and Digital-to-Analog Converter (DAC) account for -90% of any DNN system power, the minimum ADC resolution required was evaluated as a function of array size, RRAM states, and pulse resolution (Figures 4(d) and (e)). In these graphs, the inputs represent the number of RRAMs contributing to each output. The minimum ADC resolution required to prevent OE degradation by <2% for different combinations of RRAM resolution, pulse levels, and contributing RRAMs are presented in these figures. Using the above result, the power required per convolution was evaluated for different encoding schemes (Figures 4(f)). For this analysis, a planar staircase array according to an example embodiment with 120 outputs (12 AS, ten outputs/AS) with each output connected to 81 RRAM devices was considered. A complete utilization of the array was assumed. The resultant matrices derived from a matrix split stored on separate arrays. Furthermore, owing to the ceil and floor state matrices, each resultant matrix is stored in two separate arrays. Power and area estimates for the 1-bit DAC functioning at 0.1V according to an example embodiment are documented in the Table 1.

TABLE 1

Power and Area of different components

COMPONENT PROPERTIES POWER AREA

RRAM 0. IV Vread/ 16 states 16.34nW 0.0081pmm 2

DAC 1-bit/ 70MHz 1.61pW 0.166pmm 2

ADC 8-bit/ 1.2GHz 2mW 0.0012mm 2

SA - 77.5nW 0.0391pmm 2

Multiplier 16-bit/ 1.89GHz O.188mW 0.002612mm 2

Adder 16-bit/ 40MHz 1.703 pW 16.5pmm 2

Maxpool - 0.4mW 0.00024mm 2

ReLU - 0.2mW 0.0003mm 2

Input Register 2KB 1.24mW 0.0021mm 2

Output Register 2KB 1.12mW 0.0021mm 2 eDRAM 64KB/ 4 banks/ 256 bus 20.7mW 0.083mm 2 width eDRAM-to-IM 384 wires 7mW 0.09mm 2

Router - 10.5mW 0.03775mm 2

Hyper tile - 10.4W 22.88mm 2

Cycle time 100ns

In Figure 4(f), with an increase in pulse levels an increase in the system power was observed according to an example embodiment due to higher ADC resolution and DAC operating frequency. But, an increase in the RRAM states increases ADC & RRAM power consumption. Also, the greater the matrix split, the greater the ADC accesses, which leads to higher power consumption. Preferably, these factors are considered while designing an optimal system according to an example embodiment capable of achieving high output accuracy with minimal power consumption. To emphasize this, the power consumption by different encoding schemes to achieve similar output accuracy was compared in Figure 4(g). One observes that Sl_5_6 encoding scheme uses the least power for OE compared to other encoding schemes.

Accelerator design according to an example embodiment

Neural network implementation according to an example embodiment

Following the evaluation of the in-memory compute methodology according to an example embodiment, DNNs were implemented using the co-designed system according to an example embodiment. A visual depiction of a 4-layer DNN 500, with all the involved processes and system architecture, is given in Figure 5(a). For neural networks, activation functions used (ReLU, sigmoid) result in min([B])>0. In addition, kernel weights can be represented as a gaussian function with a mean of 0. Thus, min([A])<0 and hence sign(min([A]))=-l. Substituting sign(min([A])) =-l and sign(min([B])) =0 in (5) and using X=max([Ai])/2, we get:

In the above equation, Vn/jt: voltage accumulated at the integrator output, A1/3: quantization step of [M x ], A2: quantization step of the input image, B x i,j/B y ij : i* row and j* column elements of the state matrices of the input image.

For neural networks with ideal gaussian weight distribution, A i-As & justifies neglecting the terms involving [B] elements. Also, one can eliminate the additional [B] terms in the calculation by making the device conductance at so to be OS. Here, a non-zero conductance was chosen to alleviate the high device variability that RRAM devices exhibit close to Highest Rank Selector (HRS). Figure 5(b) shows the Modified National Institute of Standards and Technology database (MNIST) classification accuracy for different encoding schemes for a 3- layer DNN, i.e. a “subset” of the 4-layer DNN 500 depicted in Figure 5(a), with the simplification outlined above. Considering the OE, system power, and 3 -layer DNN accuracy, the Sl_4_3 encoding scheme was chosen for further evaluations, according to an example embodiment. Using the above encoding scheme, the classification accuracy for MNIST database was evaluated using the python-matlab interface developed. From Figure 5(c) one observes that the classification accuracy of the scheme for different CNNs (a 3-layer DNN and a 4-layer DNN) according to an example embodiment is comparable to software implementation.

Pipelined-accelerator design

To understand the effect of using the staircase array according to an example embodiment on accelerator power/ area, the system parameters per array were evaluated as a function of Outputs/AS and the number of AS forming each array (#AS). The Sl_4_3 scheme was considered for this analysis and the ADC resolutions were derived from Figure 4(d) based on contributing RRAMs. In addition to the various analog and A/D interface circuits, the various digital components (Multipliers, adders, Input Registers, Output registers) required for processing data within these arrays according to an example embodiment were also considered. Multiple arrays according to an example embodiment are assumed to share the available ADCs, to enable the complete utilization of the various digital components. The ADC outputs are fed into the adders, the results of which are supplied to the multipliers. Any residual additions/ subtractions are assumed to be executed in the tile top and not considered for analysis. The power and area of individual components are as given in the Tables 1 and 2, respectively. Figures 6(a) and (c) delineate that an increase in the outputs/AS and the #AS results in a steady decrease in power, according to example embodiments. This decrease is owing to increased utilization of available resources and plateaus after reaching a threshold value (PP). From Figure 6(b), one observes an initial dip in the area followed by an exponential rise with an increase in outputs. Initially, for low inputs, the routing between consecutive AS remains constant while the sub-array area increases. However, this increase is lower than the dip in the area due to increased DAC sharing. Beyond outputs= Kemel_rows+1, any increase in outputs leads to an increased track requirement between consecutive AS and sub-arrays. Such an increase in track requirement leads to an exponential rise in the RRAM area, thus making it the dominant factor subsequently. As an increase in AS does not increase the routing/ track requirements while increasing resource sharing, one observes a steady decline in the area with an increase in outputs in Figure 6(d), according to example embodiment.

Furthermore, the performance of the system according to an example embodiment was compared with the staggered-3 D array and Manhattan layout, as a function of kernel size for the Sl_4_3 encoding scheme, in Figures 6(e) and (f). For a 28x28 input, the power and area consumed for the parallel convolution output generation was compared for the different layouts and kernel sizes. 64 kernel sets operating on the same images were considered to allow for the full utilization of the Manhattan array; the size, ADC resolution are dynamic for different layouts and determined based on the kernel (Figure 4(d)). For the Manhattan layout, 3x3 kernels are processed on arrays of size 18x64, 5x5 on 50x64, 7x7 on 49x64, and 9x9 on 64x64. Owing to I-R drop issues, the size of the Manhattan array was capped at 64x64 (~8% degradation). 9x9 kernels on arrays of size 10x20 (10 outputs/AS, 20 AS), 7x7 on 22x22, 5x5 on 24x24, and 3x3 on 26x26 are processed for the planar-staircase layout according to an example embodiment. For the staggered-3D version, one observes no increase in the I-R drop irrespective of the inputs and outputs, and hence a 256x256 array was considered (Figure 3(e)) with a varying number of RRAM layers (capped at 9). The RRAMs processing the ceil and floor state matrix elements feed into the same integrator circuit in the staggered-3D layout. The power and area of various memory controller units are documented in the Table 1. In the Figures 6(e) and (f), MH_1K corresponds to the parameters for the Manhattan array processing a single kernel, while MH_64K is for the processing of 64. Since the Manhattan array parameters are dependent on the number of kernels, the worst and best cases were presented.

For the Staggered-3D array, the lower ADC resolution and input regeneration result in the lowest power/area consumption among the considered layouts for a 3x3 kernel. But an increase in contributing RRAMs with kernel size increases the ADC resolution and accesses. Due to this, power consumption is higher for staggered-3D arrays for larger kernels. Though the RRAM footprint is lower with the 3D system, the peripheral requirement is higher (maximum of 9 contributing RRAMs per output as shown in Figure 3(e)), and one observes higher savings with other layouts for large kernels. Multiple 5x5 kernels and the ceil/floor matrices can be simultaneously processed using a single array for the Manhattan layout. Such complete utilization lowers input regeneration and ADC usage to reduce power/area consumption compared to other structures for this case. But with an increase in kernel size, the kernel will need to be partitioned into multiple parts for processing using the Manhattan arrays. Such a split increases the ADC accesses and input regeneration, leading to increased power and area requirements. For a kernel size of 9x9, one observes area savings of -73% and power reduction of 68% by the planar-staircase layout according to an example embodiment over the MH_1K case, while also resulting in significant savings over the MH_64K execution.

In addition, convolution of multiple kernels can be executed with the same input image using a single planar staircase array according to an example embodiment by storing the elements of different filters in different AS. Thus, the outputs of individual AS belong to the same kernel, while disparate AS outputs pertain to distinct kernels. Such execution requires rotating each kernel's columns across the sub-arrays of the AS according to an example embodiment based on the location of the inputs applied. Furthermore, when outputs/AS > Kernel_rows+1, input lines are shared between adjacent AS alone according to an example embodiment. Therefore, one can process kernels acting on multiple inputs, independent of whether they are contributing to the same output, by disregarding an AS in the middle, thereby separating the inputs. Using this, one can process [M3] and [M4] of numerous images using a single array to reduce the area and power requirement, according to an example embodiment. Such flexible processing enables complete utilization of the planar- staircase arrays according to an example embodiment and is not possible using the Manhattan layout.

Using the results from the previous analyses, the area and power efficiencies of the pipelined accelerator was evaluated for different configurations. The performance of the accelerator shown in Figure 1(d) according to example embodiments is dependent on factors such as the number of IMs per tile (I), the number of individual arrays per IM (C), the number of available ADCs in an IM (A), the number of AS per array (AS), and the total outputs (O) per array. As ADCs and eDRAM contribute most to the accelerator power and area, it is preferred to optimize their requirement while enabling higher throughput. Based on the benchmarks, the size of the eDRAM buffer in a tile was established to be 64 KB. The outputs of the previous layer were stored in the current layer's eDRAM buffer. When new inputs necessary for the processing of kernels in this layer show up, it allows the current layer to proceed with its operations.

In the first cycle of the operation, the 16-bit inputs stored in the eDRAM are read out and sent to the PU for state matrix determination. The eDRAM and shared bus were designed to support this maximum bandwidth. A PU consists of a sorting unit to determine the peak, multipliers for fast division followed by comparators and combinatorial circuits. The state matrix elements are sent over the shared bus to the current layer’s IM and stored in the input register (IR). The IR width was determined based on the unique inputs to an array and the number of arrays in each IM. While the number of DACs required by each array=(x+r-l)x(n+n-l+(0.5x(n-l)x(n- 1)), the number of unique inputs to each array=(x+r-l)x(n+n-l). Variable definition remains unchanged from what was described above in the array size evaluation section. The transfer of data from eDRAM to IR was performed within a 100 ns stage. After this, the IM sends the data to the respective arrays and performs in-memory computing during the next cycle. At the end of the 100 ns computation cycle, the outputs are latched in the SA circuits. In the next cycle, the ADCs convert these outputs to their 8-bit digital equivalents. The results of the ADCs are merged by the adder units (A), post which they are multiplied with the quantization step using 16-bit multipliers, together indicated as "A+M" in Figure 1(d), and stored in the output register (OR) of the IM. In the 5th cycle, the final output stored in the OR is sent to the central OR units in the tile. These values may undergo another step of addition and merging with the central OR in the tile if the convolution is spread across multiple IMs. The contents of the central OR are sent to the ReLU unit (RU) in cycle 6. The ReLU unit consists of simple comparators that incur a relatively small area and power penalty. After processing the ReLU outputs using the max pool unit (MP) in cycle 7, the output feature map elements are written into eDRAM of the next layer in cycle 8. The mapping of layers to different tiles, IMs, and the resulting pipeline are determined off-line and loaded into control registers that drive finite state machines. For non- gaussian distributions with non-zero high resistance s m in, additional multipliers and adders are included to dedicated IMs processing [M3] and [M4] elements. These circuits calculate the residual value given in (10) within the IM while in-memory convolution is being executed. The residual values are added to the array outputs in subsequent cycles without disturbing the pipeline.

Furthermore, to deal with both the convolution layers and fully connected layers, the accelerator according to an example embodiment is divided into an equal number of Manhattan array tiles and Planar- staircase array tiles. It is noted that the staircase tiles are expected to only be optimally used for the execution of convolution operations. Since any CNN consists of both convolution and fully connected layers (compare Figure 5(a)), both planar- staircase arrays and Manhattan arrays were used according to an example embodiment for best results. For the accelerator design, planar staircase arrays with 81 contributing RRAMs per output according to an example embodiment and Manhattan arrays of size 64x64 were considered. The digital overloads of different tiles are made equal by choosing the appropriate number of arrays per IM based on the array type. The area and power usage was estimated from the full layout of the system at the 40nm node, including all peripheral and routing circuits needed to perform all operations. Power and area estimates for the determined optimum performance of the accelerator according to an example embodiment at the O120_AS12_I8_C8 (Planar-staircase tiles) configuration are provided in the Table 2.

TABLE 2

Power and Area Estimates

Component Value

Tech, node 40nm

Outputs 120

Unique Inputs 360

#Operations/Array 19440

#RRAM Devices 81x120 RRAM power 0.16mW

RRAM area 45.489pmm 2

DA C resolution 1-bit

#DAC accesses 1152

DA C power 1.845mW

DAC Area 0.0001912mm 2

ADC resolution 8 -bit

#ADC accesses 1

ADC power 2mW

ADC area 0.0012mm 2

SA accesses 120

SA Power 9.31pW

SA Area 4.6875pmm 2

LPU + Routing Power 2.321mW

LPU + Routing Area 0.008516mm 2

Frequency 10MHz

Total Power 6.3353mW

Total Area 0.0099574mm 2

Parameters per array for the O120_AS12_A8_I8_C8 configuration. Each chip consists of 84 such tiles.

It is noted that the power-efficiency of the technique according to an example embodiment can be further improved by efficient complementary metal-oxide semiconductor (CMOS) routing techniques. Also, while the above described optimizations focus on the layout of RRAM arrays and M2M execution within them, using an example embodiment in conjunction with other system- level optimizations such as buffer- size reduction, CMOS routing optimization could achieve higher area-efficiency & power-efficiency.

In an example embodiment a planar- staircase array with AI2O3 RRAM devices has been described. By applying voltage pulses to the staircase routed array's bottom electrodes for convolution execution, a concurrent shift in inputs is generated according to an example embodiment to eliminate matrix unfolding and regeneration. This results in a -73% area and -68% power reduction for a kernel size of 9x9, according to an example embodiment. The inmemory compute method according to an example embodiment described increases output accuracy and efficiently tackles device issues, and achieves 99.2% MNIST classification accuracy with a 4-bit Kernel resolution and 3 -bit input feature map resolution, according to an example embodiment. Variation tolerant M2M according to an example embodiment is capable of processing signed matrix elements for kernels and input feature map as well, within a single array to reduce area overheads. Using the co-designed system, peak power and area efficiencies of 14.14TOPsW -1 and 8.995TOPsmm’ 2 were shown, respectively. Compared to state-of-the- art accelerators, an example embodiment improves power efficiency by 5.64x and area efficiency by 4.7x.

Embodiments of the present invention can have one or more of the following features and associated benefits/adv antages:

Low-complexity, low-power staggered layout of the crossbar:

Bottom electrode of the proposed 2D-array is routed in a staggered fashion. Such a layout can efficiently execute convolutions between two matrices while eliminating input regeneration and unfolding. This, in turn, improves throughput while reducing power, area and redundancy. In addition, fabrication of a staggered-2D array is extremely easy compared to 3D array fabrication.

Pulse application at bottom electrode:

Inputs are applied at the bottom electrodes of the device and collect the output current from the top electrodes. By using top electrodes for device programming and bottom electrode for data processing, both the programming time and processing time can be reduced.

Low-complexity mapping of kernel values to RRAM conductance:

Current in-memory methods use complex algorithms to map kernel values to RRAM resistances in multiple arrays for parallel output generation. In an example embodiment, the mapping methodology is extremely simple and leads to reduction of pre-processing time

High throughput while maintaining low-power and low-area:

Compared to current state-of-the-art accelerators using GPUs, ASIC -based systems and RRAM-based systems, a co-designed system according to an example embodiment shows higher throughput while using lower power and lower area. This is owing to the reduction in input regeneration and unfolding, which in turn reduces peripheral circuit requirement.

Scalability and ease of integration with other emerging memories: A co-designed system according to an example embodiment can be scaled based on application requirements and can be integrated with all other emerging memories such as Phase-Change Memories (PCMs), Oxide-RRAMs (Ox-RRAMs) etc

In one embodiment, a memory device for deep neural network, DNN, accelerators, the memory device comprising: a first electrode layer comprising a plurality of bit-lines; a second electrode layer comprising a plurality of word-lines; and an array of memory elements disposed at respective cross-points between the plurality of wordlines and the plurality of bit- lines; wherein at least a portion of the bit-lines are staggered such that a location of a cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to a cross-point between said bit-line and a second word-line adjacent the first wordline; or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line.

Where at least a portion of the bit-lines are staggered, the array of memory elements may comprise a plurality of array-structures, ASs, each AS comprising a set of adjacent word-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

The memory device may be configured to have a digital to analog converter, DAC, circuit coupled to the bit-lines for inference processing. The memory device may comprise a connection layer separate from the first and second electrode layers for connecting intermediate bit-line inputs disposed between adjacent ones of the word-lines to the DAC circuit for inference processing.

The memory device may be configured to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the word-lines for inference processing.

Where at least a portion of the word-lines are staggered, the array of memory elements may comprise a plurality of array-structures, ASs, each AS comprising a set of adjacent bit-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output. The memory device may be configured to have a digital to analog converter, DAC, circuit coupled to the word-lines for inference processing. The memory device may comprise a connection layer separate from the first and second electrode layers for connecting intermediate word-line inputs disposed between adjacent ones of the bit-lines to the DAC circuit for inference processing.

The memory device may be configured to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the bit-lines for inference processing.

Each memory element may comprise a switching layer sandwiched between the bottom and top electrode layers. The switching layer may comprise AI2O3, SiCh, HfCh, M0S2, TaO x , TiCh, ZrO 2 , ZnO, GeSbTe, Cu-GeSe x etc.

At least one of the bottom and top electrode layers may comprise an inert metal such as Platinum, Palladium, Gold, Silver, Copper, Tungsten etc.

At least one of the bottom and top electrode layers may comprise a reactive metal such as Titanium, TiN, TaN, Tantalum etc.

Figure 8 shows a flowchart 700 illustrating a method of fabricating a memory device for deep neural network, DNN, accelerators, according to an example embodiment.

At step 702, a first electrode layer comprising a plurality of bit-lines is formed.

At step 704, a second electrode layer comprising a plurality of word-lines is formed.

At step 706, an array of memory elements disposed at respective cross-points between the plurality of word-lines and the plurality of bit-lines is formed, wherein at least a portion of the bit-lines are staggered such that a location of a first cross-point between the bit-line and a first word-line is displaced along a direction of the word-lines compared to the cross-point between said bit-line and a second word-line adjacent the first word-line;. or wherein at least a portion of the word-lines are staggered such that a location of a cross-point between the word-line and a first bit- line is displaced along a direction of the bit- lines compared to a cross-point between said word-line and a second bit-line adjacent the first bit-line

Where at least a portion of the bit-lines are staggered, the array of memory elements may comprise a plurality of array-structures, ASs, each AS comprising a set of adjacent word-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

The method may comprise configuring the memory device to have a digital to analog converter, DAC, circuit coupled to the bit-lines during inference processing. The method may comprise forming a connection layer separate from the first and second electrode layers for connecting intermediate bit-line inputs disposed between adjacent ones of the word-lines to the DAC circuit during inference processing.

The method may comprise configuring the memory device to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the word-lines during inference processing.

Where at least a portion of the bit-lines are staggered, the array of memory elements may comprise a plurality of array-structures, ASs, each AS comprising a set of adjacent bit-lines, wherein each AS comprises a plurality of sub-arrays, wherein each sub-array is configured to take inputs from a row of an input matrix and to have the elements of a row of a kernel applied in the DNN accelerator contributing to the output.

The method may comprise configuring the memory device to have a digital to analog converter, DAC, circuit coupled to the word-lines during inference processing. The method may comprise forming a connection layer separate from the first and second electrode layers for connecting intermediate word-line inputs disposed between adjacent ones of the bit-lines to the DAC circuit during inference processing.

The method may comprise configuring the memory device to have an analog to digital converter and sense amplifier, ADC/SA, circuit coupled to the bit-lines during inference processing.

Each memory element may comprise a switching layer sandwiched between the bottom and top electrode layers. The switching layer may comprise AI2O3, SiCh, HfCh, M0S2, TaO x , TiCh, ZrCh, ZnO, GeSbTe, Cu-GeSe x etc.

At least one of the bottom and top electrode layers may comprise an inert metal such as Platinum, Palladium, Gold, Silver, Copper, Tungsten etc.

At least one of the bottom and top electrode layers may comprise a reactive metal such as Titanium, TiN, TaN, Tantalum etc.

Figure 8 shows a flowchart 800 illustrating a method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator, according to an example embodiment.

At step 802, the kernel is transformed using

At step 804, the feature map is transformed using

At step 806, [Ai] is split using

At step 808, [Ui] is split using At step 810, a state transformation is performed on [Mi], [M2], [M3], and [M4] to generate memory device conductance state matrices to be used to program memory elements of the memory device.

At step 812, [Bi] and [U2] are used to determine respective pulse widths matrices to be applied to word-lines/bit-lines of the memory device.

Performing a state transformation on [Mi], [M2], [M3], and [M4] to generate the memory device conductance state matrices may be based on a selected quantization step of the DNN accelerator. Using [Bi] and [U2] to determine respective pulse widths matrices may be based on the selected quantization step of the DNN accelerator.

The method may comprise splitting each of [Mi] and [M2] using equations equivalent to performing a state transformation on the resultant split matrices to generate additional memory device conductance state matrices to be used to program memory elements of the memory device, for increasing an accuracy of the DNN accelerator.

In one embodiment, a memory device for a deep neural network, DNN, accelerator is provided, configured for executing the method of method of convoluting a kernel [A] with an input feature map [B] in a memory device for a deep neural network, DNN, accelerator according to any one of the above embodiments.

In one embodiment, a deep neural network, DNN, accelerator is provided, comprising a memory device according to any one of the above embodiments.

Aspects of the systems and methods described herein may be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits (ASICs). Some other possibilities for implementing aspects of the system include: microcontrollers with memory (such as electronically erasable programmable read only memory (EEPROM)), embedded microprocessors, firmware, software, etc. Furthermore, aspects of the system may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. Of course the underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (MOSFET) technologies like complementary metal-oxide semiconductor (CMOS), bipolar technologies like emitter- coupled logic (ECL), polymer technologies (e.g., silicon-conjugated polymer and metal- conjugated polymer-metal structures), mixed analog and digital, etc. The various functions or processes disclosed herein may be described as data and/or instructions embodied in various computer-readable media, in terms of their behavioral, register transfer, logic component, transistor, layout geometries, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. When received into any of a variety of circuitry (e.g. a computer), such data and/or instruction may be processed by a processing entity (e.g., one or more processors).

The above description of illustrated embodiments of the systems and methods is not intended to be exhaustive or to limit the systems and methods to the precise forms disclosed. While specific embodiments of, and examples for, the systems components and methods are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the systems, components and methods, as those skilled in the relevant art will recognize. The teachings of the systems and methods provided herein can be applied to other processing systems and methods, not only for the systems and methods described above.

It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive. Also, the invention includes any combination of features described for different embodiments, including in the summary section, even if the feature or combination of features is not explicitly specified in the claims or the detailed description of the present embodiments.

In general, in the following claims, the terms used should not be construed to limit the systems and methods to the specific embodiments disclosed in the specification and the claims, but should be construed to include all processing systems that operate under the claims. Accordingly, the systems and methods are not limited by the disclosure, but instead the scope of the systems and methods is to be determined entirely by the claims.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise," "comprising," and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of "including, but not limited to." Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words "herein," "hereunder," "above," "below," and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word "or" is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.