Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD TO ACCELERATE DEEP LEARNING APPLICATIONS FOR EMBEDDED ENVIRONMENTS
Document Type and Number:
WIPO Patent Application WO/2023/033759
Kind Code:
A1
Abstract:
The present disclosure relates to the field of computation and covers methods to achieve faster processing speed for deep learning applications especially for embedded environments. The present disclosure specially relates to a sub-class of deep learning applications called convolutional neural networks; it is specially designed to fasten processing speed for embedded environments where energy efficiency requirements and lack of fast co-processors (such as GPUs) may necessitate processing to be carried out over the central processing unit (CPU).

Inventors:
OKUYAN ERKAN (TR)
Application Number:
PCT/TR2022/050711
Publication Date:
March 09, 2023
Filing Date:
July 06, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ASELSAN ELEKTRONIK SANAYI VE TICARET ANONIM SIRKETI (TR)
International Classes:
G06N3/04; G06F12/0862; G06F12/0888; G06F12/0897; G06N3/063
Foreign References:
EP3816867A12021-05-05
Other References:
VIEIRA JOAO ET AL: "Processing Convolutional Neural Networks on Cache", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 1658 - 1662, XP033793998, DOI: 10.1109/ICASSP40776.2020.9054326
Attorney, Agent or Firm:
DESTEK PATENT, INC. (TR)
Download PDF:
Claims:
CLAIMS A method to accelerate deep learning applications for embedded applications comprising the steps of;

• dividing rows of an image into consecutive row groups,

• selecting a group of calculation,

• selecting a column of calculation within the group,

• selecting a row of calculation within the group and the current column,

• calculating pixel belonging to the selected group, column and row, o proceeding to selecting a row step, if all rows within the selected group and column are not processed, o proceeding to selecting a column step, if all rows within the selected group and column are processed, o proceeding to selecting a group step, if all columns within the selected group are processed, o ending calculation if all groups are processed. The method to accelerate deep learning applications for embedded applications according to Claim 1 wherein row ordered, and column ordered modes of execution are user configurable. The method to accelerate deep learning applications for embedded applications according to Claim 1 wherein group size calculation is carried out using formula s = round( sqrt(size(Lx)) ) where size(Lx) corresponds to the size of cache in terms of pixels where optimization is directed to. The method to accelerate deep learning applications for embedded applications according to Claim 1 or Claim 3 wherein cache level that optimization is directed to is determined by the size of the image to process. The method to accelerate deep learning applications for embedded applications according to Claim 1 or Claim 3 wherein cache level that optimization user controlled and experimentally determined. The method to accelerate deep learning applications for embedded applications according to Claim 1 or Claim 3 wherein if information about application specific data is available, formula to calculate group size is modified to s = round( sqrt(size(Lx)-size(occupied)) ) where size(Lx) corresponds to the size of cache in terms of pixels where optimization is directed to, and size(occupied) is application specific data which indicates unavailable portion of the cache.

Description:
A METHOD TO ACCELERATE DEEP LEARNING APPLICATIONS FOR EMBEDDED ENVIRONMENTS

Field of the Invention

The present disclosure relates to the field of computation and covers methods to achieve faster processing speed for deep learning applications especially for embedded environments.

The present disclosure specially relates to a sub-class of deep learning applications called convolutional neural networks; it is specially designed to fasten processing speed for embedded environments where energy efficiency requirements and lack of fast coprocessors (such as GPUs) may necessitate processing to be carried out over the central processing unit (CPU).

Background of the Invention

It is known that there are methods which employs deep learning paradigm, and such methods are used in order to solve many different kinds of critical problems. The versatility and success of such methods made deep learning a very hot topic and a main area of research and development. Success and importance of deep learning can be stated by pointing out that proposed methods have dramatically improved the state of the art in many areas such as visual object recognition, object detection, speech recognition and many other domains such as bioinformatics.

Artificial neural networks are inspired from the biological systems and mimics the computations carried out by the neurons in biological structures (such as brain). Artificial neural networks consist of a collection of units called neurons (artificial and simulated in this case). There are binary connections between neurons which are used to transmit signals to other neurons. The neuron receiving the signal (from multiple sources), processes the information gathered (and possibly their internal state) and used the processing result to signal the downstream neurons. In artificial neural networks, neurons are ordered in a layered manner and last layers of neurons outputs are intended to produce a solution to a specific task. Deep neural network is a special kind of artificial neural network where there are multiple layers between input layers and output layers (hence the name deep). Deep neural networks have the ability to model complex non-linear relationships.

Convolutional neural networks are special class of deep neural networks, and it is mainly used to process visual data (such as images and videos). Fully connected neural networks are such networks where a neuron in a layer is connected to all the neurons in the next layer. These kinds of networks are prone to overfitting to the test data. However, convolutional neural networks are regularized modes of fully connected networks. This regularisation process is carried out by utilising the hierarchical pattern in data (such as spatial locality in an image). Thus, convolutional neural networks have the ability to extract complex patterns using smaller and simpler patterns.

Deep learning, albeit a very powerful approach, comes with a very heavy computational load. However embedded systems generally need to be very energy efficient and employ processors that are computationally inferior to many contemporary processing units. Furthermore, many convolutional neural network approaches use GPGPU (general purpose computing on graphics processing units) to accelerate the application at hand. Embedded systems, in order to be energy efficient, generally don’t include GPUs (graphics processing units) for such usage. Thus, the present invention is mainly designed for embedded environments where a CPU is used to carry out execution over the test dataset.

Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection” document discloses single stage object detection method. It proposes a straightforward architecture by dividing image into grids and detecting the object in those grids. Due to its straightforward architecture real-time applications can be created via utilizing this method (albeit with the help of GPU like architecture). However due to single stage nature of the method, accuracy of the method is inferior with respect to two stage methods.

The document named “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” (Ren et aL, 2015) discloses a two-stage object detection method. During first stage of the method, region proposal network investigates objectness of the region to find out whether there is an object in the region. During second stage of the method, if an object region is proposed, a network with two outputs is constructed by the detector. First output is the classifier output and the second output is the bounding box of the object. Comparatively, this method is more accurate than one stage methods. However, two stage nature of the methods results in slower execution times (even with GPGPU help, only interactive rates are achieved).

“Fully-Convolutional Siamese Networks for Object Tracking” (Bertinetto el aL, 2016) document discloses a network that uses the same weights while working on two different input vectors. The aim for this method is to produce a map of similarity score between the input image and the search image. Using the similarity measurement map, the object is distinguished from the background/ clutter. Due to the simple network architecture, the method works in real-time if GPU like co-processors are utilized.

The application numbered CN110766133A discloses a data processing method and device in embedded equipment, equipment and a storage medium. The data processing method comprises the steps of obtaining an input matrix corresponding to input data; inputting the input matrix into a first intermediate network layer of a deep learning model; obtaining a plurality of block matrixes; obtaining a weight matrix and an offset matrix; obtaining a first output matrix of the current intermediate network layer; taking the first output matrix as an input matrix of the next intermediate network layer, and executing blocking operation until a target output matrix output by the last intermediate network layer is obtained; obtaining target output data

All the methods discussed here uses the convolution operation as a sub function for the network. However, all methods discussed define the state of the art for the convolutional neural network literature, none of the methods propose an efficient way to process the inputs without help of co-processors like GPUs. There are many other satisfactory methods proposed in computer vision domain focusing on convolutional neural network approaches. Many of these approaches are computationally demanding. Thus, many of these methods work over graphics processing units to satisfy this demand. However, many environments (especially embedded ones) cannot satisfy this demand either because of lack of hardware or because of power concerns. Thus, an efficient method to increase efficiency of said methods for such environments is needed.

Present invention proposes a method to accelerate a core function (convolution) that is used by state of the art convolutional neural network methods and works on a CPU based system. It is achieved by efficient usage of cache hierarchy in the system. Summary of the Invention

An objective of the present invention is to achieve faster processing speed, for a sub-class of deep learning applications called convolutional neural networks, especially for embedded environments. The present invention is specially designed to fasten processing speed for embedded environments where energy efficiency requirements and lack of fast coprocessors (such as GPUs) may necessitate processing to be carried out over the central processing unit (CPU).

Brief Description of the Figures

A system and method realized to fulfil the objective of the present invention is illustrated in the accompanying figures, in which:

Figure 1 shows an example cache hierarchy.

Figure 2 shows an example of a commonly used processing order (row ordered).

Figure 3 shows an example of the proposed processing order (row ordered).

Figure 4 shows a flowchart of the processing order shown in Figure 2.

Figure 5 shows a flowchart of the processing order shown in Figure 3 (proposed method).

Figure 6 shows an example of a commonly used processing order (column ordered).

Figure 7 shows an example of the proposed processing order (column ordered).

Detailed Description of the Invention

There is an inherent sub-function that is crucial in fast execution of convolutional neural networks. This is a convolution operation, and this can be thought as passing a filter over an image. Table 1 shows a filtering operation for a 3x3 filter.

Table 1 : An Example Image and Sample Convolutions Over the Image Table 1 shows a sample image. In this table, column and row numbers are clearly shown and each pixel is represented with its’ row position and its’ column position. In this example, two 3x3 filter outputs are calculated. First calculation is carried out with a total of nine pixels: {Pixel(1 ,1), Pixel(1 ,2), Pixel(1 ,3), Pixel(2,1), Pixel(2,2), Pixel(2,3), Pixel(3,1 ), Pixel(3,2), Pixel(3,3)}. Second calculation is carried out with pixels: {Pixel(2,2), Pixel(2,3), Pixel(2,4), Pixel(3,2), Pixel(3,3), Pixel(3,4), Pixel(4,2), Pixel(4,3), Pixel(4,4)}. There is a function F that maps these 9 input pixels to an output pixel (which corresponds to an element in the output image) and F function has weights, bias, threshold (and possibly other properties). Weights, bias, threshold are what makes the F function behave like a neuron. Ultimately the F function produces an output value with inputs (9 pixel values for the example), weights, bias and threshold; and this value corresponds to a pixel in the output image. This image and correspondence are shown in Table 2. In Table 2, calculated two 3x3 filter outputs corresponding to the ones shown in Table 1 is presented.

Table 2: An Example Processed Image Produced from the Image Shown in Table 1 via a 3x3 Convolution Operation Using Function F

The pixel located at (Row 2, Column 2) of the output image (see Table 2) is a function of 9 pixels of the input image. These nine pixels are, Pixel(2,2) of the input image and the neighboring 8 pixels (see Table 1). To see the relation visually, input and output pixels are shaded the same. Another example can be given for the (Row 3, Column 3) and this relation is shown by with a different shading (again see Table 1). As it can be seen from the example, there is an overlap of used pixels while calculating (Row 2, Column 2) and (Row 3, Column 3). These pixels are Pixel(2, 2), Pixel(2, 3), Pixel(3, 2) and Pixel(3, 3). The normal process of CPU for a calculation is retrieval of data from the memory and using it for calculation. To speed up the process last retrieved data is stored in the cache for fast retrieval. Thus, finding an order of pixels for processing which maximizes the overlaps of regions (such as shown in the example) to decrease execution times is possible. For example, after processing for pixel (Row 2, Column 2) and retrieving data for the set {Pixel(1 ,1), Pixel(1 ,2), Pixel(1 ,3), Pixel(2,1), Pixel(2,2), Pixel(2,3), Pixel(3,1 ), Pixel(3,2), Pixel(3,3)}, the data is stored in the cache. Then if we want to process for the pixel (Row 3, Column 3) using the set {Pixel(2,2), Pixel(2,3), Pixel(2,4), Pixel(3,2), Pixel(3,3), Pixel(3,4), Pixel(4,2), Pixel(4,3), Pixel(4,4)} only the set {Pixel(2,4), Pixel(3,4), Pixel(4,2), Pixel(4,3), Pixel(4,4)} needs to be retrieved from the slower main memory because the set {Pixel(2,2), Pixel(2,3), Pixel(3,2), Pixel(3,3)} is already in cache for faster retrieval. This invention tries to exploit this condition. However to detail the process we need to explain the caching process for the CPUs a bit more.

For a contemporary computer, memory access is guided by a hierarchical structure. This is done to increase the speed of access for the data used frequently while storing large amounts of data. For modern computers the memory with the slowest access yet with largest holding capacity is the main memory. The one with fastest access and with the smallest capacity is the L1 cache (which is usually placed in the CPU chip). Depending on the computer hierarchy there are other levels of caches (L2 and L3). If a specific level hierarchy (for example L1) is full and a new data is needed to be retrieved, a data from the said hierarchy (again for example L1 ) needs to be removed and replaced with the data needed. Modern computers use a least recently used strategy where each level of hierarchy keeps track of the last time that a data is used. In cases where data exchange is required, the data with the longest time past its last use is extracted. This patent document details a method that orders the execution of pixels so that a data retrieved from the slower levels of the hierarchy is used more times before it is being evicted, to reduce access to slower levels.

Table 3 shows a sample input image to execute a simple 3x3 filter. Underlined and bold pixels show the points that the filter will be run (these are the ones that have a 3x3 neighborhood, so boundary pixels are omitted). Assume that we have a single level of cache on top of the main memory and its size is 30. In this case, calculating the output for the set of pixels S1 = {(2,2), (2,3), (2,4), (3,2), (3,3), (3,4), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4)} instead of S2= {(2,2), (2,3), (2,4), (2,5), (2,6), (2,7), (2,8), (2,9)} is more efficient. Visualizing the execution is the best way to understand this fact intuitively: This visualization is shown with Table 4.

Table 3: An Example Image and Specifically Marked Processed Pixels (Ri Corresponds to Row i and Cj Corresponds to Column j)

Table 4: Two Examples of Execution Sets with Specifically Marked Processed Pixels

Assume that we have issued command to execute set S1. This requires retrieval of total of 30 data (let’s say this is equal to our cache size). Retrieved 30 data is enough to calculate outputs for a total of 12 pixels. However, in the case of S2 execution, same amount of data is retrieved from lower memory hierarchies, and this resulted of calculation of only 8 pixels’ output. This is because there is high co-usage rate of data between, for example {(2,2), (2,3), (3,2), (3,3)} (see Table 1) for S1 execution. However, for S2 execution, there is very little cousage of data (see that there is no common data between (2,2) and (2,9)). After visually analyzing Table 4, final reasoning should be this: Order of execution between pixels should be such that retrieved data to the cache resemble a square so that inside area of the square that actual calculation is carried out is maximized.

Two flowcharts are presented to explain the proposed method (see Figures 4-5). Figure 4 shows the standard order of processing the pixels. Figure 4 is for informative purposes only, therefore we do not have any claim regarding the method presented in Figure 4. Order of execution carried out for the method shown in Figure 4 is visualized in Figure 2. This is the standard method because, for a CPU like processing unit, the software component essentially implementing the methods shown in Figure 4 is implemented as a nested two for loops. Simplicity and commonality of such approach makes the method shown in Figure 4 the main and standard method for calculating convolutions for an image. In the method shown in Figure 4, order of execution is presented such that a single row of the image is processed fully before beginning the processing of the next row.

Programming languages can be classified in two ways with respect to storage of arrays in the actual memory: Row ordered, and column ordered. Row ordered languages place a row of the image after another in the main memory (i.e. there is higher spatial locality within row of the image) while column ordered languages place a column of the image after another in the main memory (i.e. there is higher spatial locality within columns of the image). Figure 2 and Figure 4 are presented in such a way because majority of programming languages are row ordered, so that this mode of implementations is more common (to exploit spatial locality within the rows). Presentation cannot be limited to row ordered examples/figures shown in Figure 2 and Figure 4. It is trivial to extend the presentation shown in Figure 2 and Figure 4 to the configuration shown in Figure 6. Without losing generality, for the remainder of the document standard mode of processing presented in Figure 4 should be thought to work on both row ordered, and column ordered configurations.

Figure 5 shows the proposed order of processing the pixels. Order of execution carried out for the method proposed in Figure 5 is visualized in Figure 3. Please see that retrieved data to the cache does resemble a square thus performance improvement can be achieved with the proposed method. This can be seen visually by inspecting the highlighted pixels in Figure 3 that are retrieved consecutively. At any time during the execution, image content present in the image resembles a square. However, after processing all the pixels that are calculable with the data within the cache, the order presented for this method requires evicting left side of the square. Instead of evicted pixels, a column in the right side of the square is retrieved from the lower level memory hierarchies. Thus, the proposed method can be thought to retrieve the pixels in the image in a sliding square formation.

Size of the square image present in the cache (at any given time) should be calculated prior to execution. The method should be adjusted accordingly (this will determine the sizes of the groups) for most effective execution. However to do so, the level in which the cache usage optimization is carried out should be determined (i.e. do we try to optimize cache usage for L1 , L2 or L3 caches?). Let’s say the size of the image is size(l). We are trying to find the lowest level of cache hierarchy x that size(L x ) « size(l) (i.e. size of the image should be comfortably larger than the cache size to effectively improve speed of the execution. However, if the image size is comfortably larger than both size(Li) and size(L 2 ), we should be optimizing for the larger L 2 cache). This is generally expected to be the best performing level. Level of the cache hierarchy that optimization takes place can be determined experimentally; best performance improvement target cache level could be used for final deployment. After determining target cache level, size of the square in one dimension can be calculated. Let’s say size of the square is s. s can be calculated as follows (size(L x ) is denoted in pixel terms): s = round( sqrt(size(L x )) )

As mentioned earlier programming languages can be classified in two ways with respect to storage of arrays in the actual memory: Row ordered and column ordered. Row ordered languages place a row of the image after another in the main memory (i.e. there is higher spatial locality within row of the image) while column ordered languages place a column of the image after another in the main memory (i.e. there is higher spatial locality within columns of the image). Figure 3 and Figure 5 are presented in such a way because majority of programming languages are row ordered, so that this mode of implementations is more common (to exploit spatial locality within the rows). Presentation cannot be limited to row ordered examples/figures shown in Figure 3 and Figure 5. It is trivial to extend the presentation shown in Figure 3 and Figure 5 to the configuration shown in Figure 7. Without losing generality, for the remainder of the document, proposed mode of processing presented in Figure 5 should be thought to work on both row ordered and column ordered configurations.

A method to accelerate deep learning applications for embedded applications effectively implements what is visually presented in Figure 3 and Figure 5.

A method to accelerate deep learning applications for embedded applications comprises the following steps,

Divide rows into consecutive row groups

- Initialize for the first group of calculation

- Check if all groups are processed

Initialize for the first column of calculation within the group

- Check if all columns within the group are processed

- Proceed to the next group

Initialize for the first row of calculation within the group and the current column

- Check if all rows within the group and the current column are processed Proceed to the next column

- Calculate for the pixel belonging to the selected group, column and row Proceed to the next row. In step divide rows into consecutive row groups, the size of the row groups that image is partitioned is found. As mentioned earlier, this is mainly done in two steps. First step is finding the cache level that the optimization is done. This is done by finding the largest cache level that has much smaller size than the image. While this definition is vague in nature, usually this means cache can hold a few lines of the image. Before starting the processing, this first step can be calculated using the recorded/retrieved cache sizes on the system and the image size to be processed. Of course, this selection can be user controlled and any deployment mode could be determined via experimentation. Second step is to find the size of the best fitting square, that size of the calculated cache level can store. As previously discussed this size s can be calculated as follows: s = round( sqrt(size(L x )) ). For example: Image has a size of 1024 x 1024 where each pixel stores 4 bytes of data; cache sizes are determined as follows: size(Li)=1024 byte, size(L 2 )=32 kbyte, size(L 3 )=1 Mbyte. In the first step, we determine each line of the image is 4096 bytes so Li can store 0.25 lines, L 2 can store 8 lines, L 3 can store 256 lines (almost a quarter of the image). We conclude that L 2 is much smaller than the image and it is the largest cache level that size(L 2 ) « size(image). Thus, we could be optimizing with respect to L 2 . In the second step we calculate the size of the best fitting square using formula s = round( sqrt(4096/4) ) = 32 (4096/4=1024 means cache can only hold 1024 pixels). This s = 32 simply means, for best performance the image needs to be partition into 32 line tall groups and processed that way (it is important to note that if an n x n filter is utilized processing 32 bit tall groups, actually corresponds to calculation of 32 - (n-1)/2 lines output. It is so because for a n x n filter, (n-1)/2 lines in the top and the bottom of the strip does not have a n x n neighborhood, so we can’t calculate outputs for them).

One small improvement, if application specific data is available, is to find out the amount of data used for application which is not related to the convolution. If a portion of the cache is occupied by some other application specific data, we can subtract this size and find the available cache size to be used for convolution. To demonstrate using an example: If size(L 2 ) = 4096 and we know that 1024 byte of this area is used for application specific data (one example of application specific data could be this: some operating systems allow some user specified data to be permanently available in cache, thus these data are not evicted in any condition), we can conclude that only 3072bytes of data is available. Then we can conclude best group size is s = round( sqrt(3072/4) ) = 28.

In initialize for the first group of calculation step, simple variable initializations take place. These initializations may include initialization of a loop variable (to hold the next group index to be processed), a stopping condition variable etc. In a similar manner, in initialize for the first column of calculation within the group step, initializations of a loop variable (to hold the next column index to be processed within the group) and initialization of a stopping condition variable etc. may take place. In a similar manner, initialize for the first row of calculation within the group and the current column step, initializations of a loop variable (to hold the next row index to be processed within the group and column) and initialization of a stopping condition variable etc. may take place. After all groups have been checked and ensured that they have been processed, we simply finish the execution and terminate. Otherwise, we continue to process with the next group of lines. While processing a group, we do execute step by checking if all rows within the group and the current column are processed. If this check returns true, it means we simply finished processing the group and we need to continue with the processing of the next group if available. Thus, we go to step “Check if all groups are processed”. However, if check about all rows within the group and the current column are processed returns false, this means that not all columns within the group are processed so we need to continue processing the next column within the group. While processing a specific group and specific column, we execute a check whether all rows within the group and the current column are processed. If this check returns true, this means we have finished processing all the rows within the column and we need to continue with the step “Check if all columns within the group are processed”. Otherwise, we have a specific group index, column index and a row index (row index is counted within the group). Thus, we can find the specific pixel that these indices correspond to and calculate the output corresponding to this pixel. For example, group index is 17, column index is 26, row index is 8 and s (this effectively corresponds to size of every group) is 10 so we can conclude that we are processing pixel 1(17*10+8,26). After finding a specific pixel position, we calculate output of the pixel in step “Calculate for the pixel belonging to the selected group, column and row” using the pixel along with neighbour pixels. Although explanation of execution order is somewhat to methodical to understand, what is effectively implemented in a method to accelerate deep learning applications for embedded applications is visually presented in Figure 3.

The method can effectively improve execution time to find convolutions within an image that is used by state of the art convolutional neural network methods which works on a CPU based system via efficient usage of cache hierarchy in the system.

Within the scope of these basic concepts, it is possible to develop a wide variety of embodiments of the inventive “a method to accelerate deep learning applications for embedded applications”. The invention cannot be limited to the examples described herein; it is essentially according to the claims.




 
Previous Patent: AN UNMANNED AERIAL VEHICLE

Next Patent: AN AUTOMATION SYSTEM