Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PICTURE FRAME PROCESSING USING MACHINE LEARNING
Document Type and Number:
WIPO Patent Application WO/2023/132840
Kind Code:
A1
Abstract:
Embodiments of apparatus and methods for processing picture frames using machine learning are disclosed. In one example, an apparatus for processing picture frames using machine learning includes a system-on-a-chip (SoC). The SoC includes a processor configured to process a first picture frame using a first convolutional neural network layer of a machine learning model, and process a second picture frame using a second convolutional neural network layer or a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

Inventors:
HUANG HSILIN (US)
WANG NIEN-TSU (US)
FAN YI (US)
Application Number:
PCT/US2022/011855
Publication Date:
July 13, 2023
Filing Date:
January 10, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ZEKU INC (US)
International Classes:
G06K9/52; G06K9/46
Foreign References:
US20200110966A12020-04-09
US20200167943A12020-05-28
US20170116718A12017-04-27
US20190327486A12019-10-24
Attorney, Agent or Firm:
ZOU, Zhiwei (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. An apparatus for processing picture frames using machine learning, comprising: a system-on-a-chip (SoC) comprising: a processor configured to: process a first picture frame using a first convolutional neural network layer of a machine learning model; and process a second picture frame using a second convolutional neural network layer or a pooling layer of the machine learning model, wherein the second picture frame occurs before the first picture frame; and the second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

2. The apparatus of claim 1 , wherein the processor is configured to: process the second picture frame using the second convolutional neural network layer; and process a third picture frame using the pooling layer; the pooling layer is processed after the second convolutional neural network layer in the machine learning model; and the third picture frame occurs before the second picture frame.

3. The apparatus of claim 2, wherein the second picture frame occurs immediately before the first picture frame; and the third picture frame occurs immediately before the second picture frame.

4. The apparatus of claim 2, wherein the pooling layer is a global pooling layer configured to process global information of the first, second, and third picture frames.

5. The apparatus of claim 2, wherein the processor is further configured to: process the first picture frame using a first skip connection layer of the machine learning model; process the second picture frame using a second skip connection layer of the machine learning model; concatenate an output of the second skip connection layer and an output of the pooling layer; and concatenate an output of the first skip connection layer and an output of a corresponding up-convolutional neural network layer of the machine learning model.

6. The apparatus of claim 1 , wherein the machine learning model comprises a U-Net; and the first convolutional neural network layer, the second convolutional neural network layer, and the pooling layer are in an encoder side of the U-Net.

7. The apparatus of claim 1 , wherein to process the first picture frame, the processor is further configured to process a chunk of the first picture frame using the first convolutional neural network layer; and the SoC further comprises an internal memory configured to store an output of the first convolutional neural network layer based on the chunk of the first picture frame.

8. The apparatus of claim 7, wherein the chunk of the first picture frame comprises a number of scan lines of the first picture frame; and the number of the scan lines is determined based on a size of a kernel used by the first convolutional neural network layer.

9. The apparatus of claim 7, wherein the internal memory is further configured to store the second picture frame; and the processor is further configured to retrieve the second picture frame from the internal memory.

10. The apparatus of claim 7, further comprising an external memory configured to store the second picture frame, wherein the processor is further configured to retrieve the second picture frame from the external memory. 11. A method for processing picture frames using machine learning, comprising: processing, by a processor of a system-on-a-chip (SoC), a first picture frame using a first convolutional neural network layer of a machine learning model; and processing, by the processor, a second picture frame using a second convolutional neural network layer or a pooling layer of the machine learning model, wherein the second picture frame occurs before the first picture frame; and the second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

12. The method of claim 11, further comprising: processing, by the processor, the second picture frame using the second convolutional neural network layer; and processing, by the processor, a third picture frame using the pooling layer, wherein the pooling layer is processed after the second convolutional neural network layer in the machine learning model; and the third picture frame occurs before the second picture frame.

13. The method of claim 12, wherein the second picture frame occurs immediately before the first picture frame; and the third picture frame occurs immediately before the second picture frame.

14. The method of claim 12, wherein the pooling layer is a global pooling layer configured to process global information of the first, second, and third picture frames.

15. The method of claim 12, further comprising: processing, by the processor, the first picture frame using a first skip connection layer of the machine learning model; processing, by the processor, the second picture frame using a second skip connection layer of the machine learning model; concatenating, by the processor, an output of the second skip connection layer and an output of the pooling layer; and - 22 - concatenating, by the processor, an output of the first skip connection layer and an output of a corresponding up-convolutional neural network layer of the machine learning model.

16. The method of claim 11 , wherein the machine learning model comprises a U-Net; and the first convolutional neural network layer, the second convolutional neural network layer, and the pooling layer are in an encoder side of the U-Net.

17. The method of claim 11 , wherein processing the first picture frame comprises processing a chunk of the first picture frame using the first convolutional neural network layer; and the method further comprises storing, by an internal memory of the SoC, an output of the first convolutional neural network layer based on the chunk of the first picture frame.

18. The method of claim 17, wherein the chunk of the first picture frame comprises a number of scan lines of the first picture frame; and the number of the scan lines is determined based on a size of a kernel used by the first convolutional neural network layer.

19. A non-transitory computer-readable medium encoded with instructions that, when executed by a processor, cause the processor to: process a first picture frame using a first convolutional neural network layer of a machine learning model; and process a second picture frame using a second convolutional neural network layer and a pooling layer of the machine learning model, wherein the second picture frame occurs before the first picture frame; the second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model; and the pooling layer is processed after the second convolutional neural network layer in the machine learning model. - 23 -

20. The non-transitory computer-readable medium of claim 19, wherein the instructions, when executed by the processor, further cause the processor to: process the second picture frame using the second convolutional neural network layer; and process a third picture frame using the pooling layer, wherein the third picture frame occurs before the second picture frame.

Description:
PICTURE FRAME PROCESSING USING MACHINE LEARNING

BACKGROUND

[0001] Embodiments of the present disclosure relate to machine learning.

[0002] Machine learning is a branch of artificial intelligence (Al) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine learning enables analysis of massive quantities of data. Machine learning has become a hot topic in many different fields. Machine learning may be applied, for example, to image enhancement, video editing, object detection and recognition, audio noise reduction, speech recognition, and natural language processing (NLP), and the like.

SUMMARY

[0003] In one example, an apparatus for processing picture frames using machine learning includes a system-on-a-chip (SoC). The SoC includes a processor configured to process a first picture frame using a first convolutional neural network layer of a machine learning model, and process a second picture frame using a second convolutional neural network layer or a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

[0004] In another example, a method for processing picture frames using machine learning is disclosed. A first picture frame is processed by a processor of a SOC using a first convolutional neural network layer of a machine learning model. A second picture frame is processed by the processor using a second convolutional neural network layer or a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

[0005] In still another example, a non-transitory computer-readable medium is encoded with instructions that, when executed by a processor, cause the processor to process a first picture frame using a first convolutional neural network layer of a machine learning model, and process a second picture frame using a second convolutional neural network layer and a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model. The pooling layer is processed after the second convolutional neural network layer in the machine learning model.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

[0007] FIG. 1 illustrates a scheme of processing picture frames using a U-Net machine learning model, according to some embodiments of the present disclosure.

[0008] FIGs. 2A-2E illustrate various data dependency graphs of a machine learning model, according to some embodiments of the present disclosure.

[0009] FIG. 3 illustrates various data dependency graphs of a U-Net machine learning model, according to some embodiments of the present disclosure.

[0010] FIG. 4 illustrates a scheme of processing picture frames using a U-Net machine learning model, according to some embodiments of the present disclosure.

[0011] FIG. 5 illustrates another scheme of processing picture frames using a U-Net machine learning model, according to some embodiments of the present disclosure.

[0012] FIG. 6 illustrates a system implementing the schemes of processing picture frames using machine learning, according to some embodiments of the present disclosure.

[0013] FIG. 7 illustrates a flow chart of a method for processing picture frames using machine learning, according to some embodiments of the present disclosure.

[0014] FIG. 8 illustrates a flow chart of a detailed method for processing picture frames using machine learning, according to some embodiments of the present disclosure.

[0015] Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

[0016] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.

[0017] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” “certain embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0018] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0019] Various aspects of picture frame processing systems will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0020] Machine learning often needs a huge amount of computing, data bandwidth, and storage. For an internet of things (loT) device and handheld system, there may be limited resources dedicated for computing, bandwidth, and storage. Moreover, power consumption may also need to be limited in a mobile, battery-powered device. Machine learning models need more computing and memory access which may cause more power consumption and may further limit the battery life and aggravate the heat problems on cellular phones or loT devices. Chip-to-chip data transfer may consume more power than internal data transfer. For example, data may be transferred between a processor of an SoC and an external memory chip, such as a dynamic random-access memory (DRAM) chip. Computing SoC logic usually uses a different manufacturing process from DRAM manufacture; they are usually using different chips and packages. Although different dies can be packaged together, the input and output (I/O) pins and physical layer may still consume more power. A larger internal memory (a.k.a., on-chip memory) of the SoC, such as static randomaccess memory (SRAM), magnetoresistive random access memory (MRAM), and resistive random-access memory (RRAM) can be used to hold more data inside the SoC and to avoid accessing too much data from DRAM. This approach may also improve power consumption. However, the larger SRAM and die size may cause the per-chip cost to be too high.

[0021] In a given machine learning model, the architecture may have an input layer and an output layer, as well as a few hidden layers. To save power consumption, the data of hidden layers may be kept inside the chip in SRAM. This may reduce or avoid the need to store and fetch the internal layers through the DRAM interface. The power consumption of memory access on the DRAM is much larger than access on SRAM, usually more than 10-times. Thus, if all the feature maps and weights of the model could be fit inside internal memory (e.g., SRAM), a lot of power could be saved compared to those schemes that need to load from and store to external memory (e.g., DRAM).

[0022] However, for video processing, which involves a large number of picture frames, it is challenging to keep all the feature maps and weights inside internal memory, especially when the feature maps are very big, for example, for 4K dimensionally pictures. The input layer data will be 4k x 4k (e.g., Bayer format red, green, green, blue (RGGB) 4 channels) x 12 bits/pixel = 24 MB. Not only that, but the output of this layer needs to be buffered. If there is a skip branch (e.g., a skip connection layer), the memory size will be increasingly large. Thus, it may be valuable to keep SRAM small and keep the access as much as possible on SRAM. One approach to process the model is to process a first layer of the model, then process the next layer and keep on going until the end of the model. However, the storage requirement for a layer might exceed the capabilities of the SRAM.

[0023] For example, FIG. 1 illustrates a scheme of processing picture frames using a U- Net machine learning model (referred to herein as “U-Net” 100), according to some embodiments of the present disclosure. U-Net 100 is based on a fully convolutional neural network and includes an encoder side 102 (a.k.a., a contracting path) and a decoder side 104 (a.k.a., an expansive path), which give it the U-shaped architecture. Encoder side 102 may be a typical convolutional neutral network that includes multiple convolutional neural network layers (e.g., Convl, Conv2, Conv3, and Conv 4 in FIG. 1) and one or more pooling layers (e.g., Global Pooling in FIG. 1, a.k.a., a max pooling layer). Decoder side 104 may include multiple up-convolutional neural network layers (e.g., upConvl, upConv2, upConv3, and upConv4 in FIG. 1). U-Net 100 also includes multiple skip connection layers (e.g., Skipl_Convl, Skip2_Convl, and Skip3_Convl in FIG. 1) each connecting encoder side 102 and decoder side 104. Decoder side 104 of U-Net 100 thus may further include concatenation nodes (e.g., Concatl, Concat2, and Concat3 in FIG. 1) and convolution neural network layers (e.g., Conv5 and Conv6 in FIG. 1), each concatenating the output of a respective skip connection layer and the output of a respective up-convolution neural network layer.

[0024] U-Net 100 can be applied for video/picture frames processing, such as image segmentation or image reconstruction. For example, in encoder side 102 during the contraction, the spatial information of a picture frame is reduced while feature information is increased. On the other hand, in decoder side 104 during the expansion, the feature and spatial information are combined through a sequence of up-convolutions and concatenations with high-resolution features from encoder side 102.

[0025] In this example, assuming the picture frame having a size of 2000x1504x4x12 bits is inputted into the input layer of encoder side 102 of U-Net 100, processed by all hidden layers of U-Net 100 (e.g., each convolutional neural network layer and pooling layer (e.g., global pooling layer) in encoder side 102 and each skip connection layer) using a layer-by-layer method, and outputted from the output layer with the same size of 2000x1504x4x12 bits. Since the layer-by- layer method is used when processing the picture frame by each hidden layer in encoder side 102, at least the outputs of Convl layer and Skipl_Convl layer need to be held in memory during the process. The outputs of Skip2_Convl and Skip3_Conv2 can also be held. Thus, a minimum memory size may be as (outputs of Skipl_Convl + Skip2_Convl + Skip3_Convl) + max (input_i data + output_i data). If all the data is kept in internal memory (e.g., on-chip SRAM), it would be very costly.

[0026] According to some embodiments, the picture frame is processed using U-Net 100 in a chunk-by-chunk manner to reduce the amount of data that needs to be held during the process. That is, each chunk of the picture frame (e.g., a number of scan lines of the picture frame), as opposed to the entire picture frame, may be processed. For example, as shown in FIG. 1, the size of memory to hold the data during the process may be (output of Skipl_Convl + Skip2_Convl) + output of Conv4 + line buffer indicated in each layer (8 of Tile layer + 8 of Concatel + 8 of Skip3_Convl + 6 of Conv4 + 6 of Up_convl + 6 of Up_Conv2 + 6 of Concat2 + ....). As can be seen, one bottleneck is from the global pooling layer, which may need the full size of the previous layer due to data dependency between layers of U-Net 100. That is, the chunk-by-chunk process cannot introduce significant improvement when a single picture frame is processed all the way through each layer of U-Net 100 as illustrated in the scheme of FIG. 1.

[0027] FIGs. 2A-2E illustrate various data dependency graphs of a machine learning model, according to some embodiments of the present disclosure. As shown in FIG. 2A, in one example, assuming a convolutional neural network layer in encoder side 102 of U-Net 100 uses a feature map B having 8 scan lines (L0 to L7) and a padding size of 1 as well as a kernel A (matrix of weights) having a size of 3 (i.e., 3x3), the result C of the convolution operation with the stride of 1 between feature map B and kernel C (i.e., the output of the convolutional neural network layer) shows that each position in result C has a dependency of 3 scan lines (the size of kernel A) from feature map B. For example, to calculate each position in scan line R0 of result C, at least the data of 3 scan lines (padding, L0, and LI) are needed from feature map B. Due to the padding of feature map B, the size of result C is the same as the size of feature map B. The line number can be calculated as R(n) = L(n)+1.

[0028] As shown in FIG. 2B, if feature map B does not have any padding, the size of result C may become smaller (e.g., having fewer scan lines) compared with the example in FIG. 2 A, but the data dependency remains the same (e.g., 3 scan lines (L0, LI, and L2) as determined based on the size of kernel A). The line number can be calculated as R(n) = L(n)+2. As shown in FIG. 2C, if the stride of the convolution operation performed by the convolution neural network layer is greater than 1 (e.g., 2), fewer positions are calculated in result C compared with the example of FIG. 2A, but the data dependency remains the same (e.g., 3 scan lines (padding, L0, and LI) as determined based on the size of kernel A). The line number can be calculated as R(n) = L(2n)+1. Similarly, as shown in FIG. 2D, if the stride of the convolution operation performed by the convolution neural network layer is greater than 1 (e.g., 2), fewer positions are calculated in result C compared with the example of FIG. 2B, but the data dependency remains the same (e.g., 3 scan lines (L0, LI, and L2) as determined based on the size of kernel A). The line number can be calculated as R(n) = L(2n)+2. For example, 1 result line R is from 3 feature map lines L, and 2 result lines R are from 3+2 feature map lines L. As shown in FIG. 2E, if a bilinear function (e.g., with a ratio of 0.5) is used in the convolution operation performed by the convolution neural network layer, fewer positions are calculated in result C compared with the example of FIG. 2B, but the data dependency still exists. The line number can be calculated as R(n) = L(n/ratio)+l. For example, 1 result line R is from 1/0.5 feature map lines L, and 2 result lines R are from (1/O.5)x2 feature map lines L.

[0029] That is, regardless of the settings of the convolution operation of the convolution neural network layer (e.g., padding, stride, ratio of bilinear function, etc.), data dependency may exist between layers of U-Net 100, which requires a certain number of scan lines from the previous layer to be held (e.g., stored in memory) during the process. For example, FIG. 3 illustrates various data dependency graphs of a U-Net 300, according to some embodiments of the present disclosure. The encoder side of U-Net 300 includes two convolution neural network layers Convl and Conv2, a global pooling layer Tensor, and the decoder side of U-Net 300 includes two up-convolution neural network layers UpConcl and UpConv2 and a concatenation node Element Add. The setting of each layer includes, for example, the kernel size (e.g., 3), stride (e.g., 1 or 2), and padding (e.g., 0 or 1). FIG. 3 also illustrates the data dependency graphs corresponding to the various hidden layers Convl, Conv2, Tensor, UpConvl, and UpConv2, respectively. As shown in FIG. 3, in order to get 2 scan lines in Element Add, 7 scan lines (L0 to L6) are needed from Convl due to data dependency. It is understood that U-Net 300 is illustrated as a simple machine learning model; as the network of the model goes deeper (e.g., with more layers), there will be more scan lines in dependency.

[0030] Referring back to FIG. 1, the data dependency between layers in encoder side 102 of U-Net 100 is illustrated as the number of scan lines that need to be kept during the process, such as 94 for Conv4, 188 for Conv3, 376 for Conv2, 752 for Conv 1, and 1504 for the input layer. In other words, even if the picture frame is processed in a chunk-by-chunk manner by U-Net 100, the minimum size of each chunk at a layer is still limited by the data dependency (i.e., the number of scan lines to be kept).

[0031] Thus, to further reduce the memory requirement of machine learning, various schemes are described below to reduce the data dependency by using picture frames occurring at different times in machine learning, as opposed to a single picture frame. In some embodiments, since the global pooling layer of a U-Net is configured to process global information of different picture frames, such as lighting, average luminance, or global coloring of the environment, which change insignificantly between different picture frames, a previous picture frame can replace the current picture frame to be processed by the global pooling layer, thereby cutting the data dependency between the global pooling layer and the previous layers. Alternatively or additionally, in some embodiments, since picture frames change insignificantly in a video taken a high scan rate (e.g., 60 Hz) most of the time, one or more previous picture frames can replace the current frame to be processed by one or more convolution neural network layers in the encoder side, thereby cutting the data dependency between different convolution 1 neural network ayers in the encoder side without sacrificing the quality of the model too much. Moreover, with the reduced data dependency, the picture frame can be processed in a chunk-by-chunk manner with a smaller size of the chunk compared with the scheme that processes a single picture frame (e.g., the current picture frame). As a result, the picture frame processing schemes disclosed herein can greatly improve the power usage, performance, and die area of SoCs.

[0032] FIG. 4 illustrates a scheme of processing picture frames using U-Net 100, according to some embodiments of the present disclosure. As shown in FIG. 4, a first picture frame t (the current picture frame) is received from the input layer and processed using each of the convolution neural network layers Convl, Conv2, Conv3, and Conv4 in encoder side 102 of U-Net 100. As opposed to processing the first picture frame t (e.g., the output of Conv4) using the global pooling layer as well as shown in the scheme of FIG. 1 , the first picture frame t at the global pooling layer (e.g., the output of Conv4) may be stored into a memory (e.g., an internal memory (e.g., SRAM) of an SoC implementing the scheme of FIG. 4 or an external memory (e.g., DRAM) outside of an SoC). Instead, a second picture frame t-1 (a previous picture frame occurring immediately before the current picture frame) is retrieved from the memory (e.g., the internal memory of the SoC or the external memory outside of the SoC), and is processed using the global pooling layer of U-Net 100. Since the global pooling layer of U-Net 100 is configured to process global information of the first and second picture frames t and t- 1 , such as lighting, average luminance, or global coloring of the environment, the second picture frame t-1 can replace the first picture frame t at the global pooling layer without sacrificing the quality of U-Net 100. On the other hand, by replacing the first picture frame t with the second picture frame t-1 at the global pooling layer, the data dependency between the global pooling layer and the previous convolutional neural network layer Conv4 is cut off. For example, as shown in FIG. 4, the data dependency between layers in encoder side 102 of U-Net 100 is illustrated as the number of scan lines that need to be kept during the process, such as 8 for Conv4, 18 for Conv3, 38 for Conv2, 78 for Conv 1, and 158 for the input layer, which are significantly reduced compared with the scheme shown in FIG. 1.

[0033] As shown in FIG. 4, the first picture frame t is also processed using a first skip connection layer Skipl_Convl or Skipl_Convl of U-Net 100. The output of the first skip connection layer Skipl_Convl or Skipl_Convl and the output of a corresponding up-convolution neural network layer UpConv4 or UpConv2 is concatenated using a corresponding concatenate node Concat3 or Concat2.

[0034] In some embodiments, the first picture frame is processed in a chunk-by-chunk manner using each of the convolution neural network layers Convl, Conv2, Conv3, and Conv4 in encoder side 102 of U-Net 100. That is, the partial output of a convolution neural network layer may be processed, and the partial output of the convolutional neural network layer may be used as the input of the next convolutional neural network layer. For example, a chunk of the first picture frame may be processed using a convolution neural network layer Convl, Conv2, Conv3, or Conv4 in encoder side 102 of U-Net 100, and the output of the convolution neural network layer may be stored in the internal memory (e.g., SRAM) of the SoC based on the chunk of the first picture frame. In some embodiments, the chunk of the first picture frame includes a number of scan lines, which may be determined based on the size of the kernel used by the convolution neural network layer. As described above with respect to FIGs. 2A-2E, the number of scan lines may be further determined based on other settings of the convolution neural network layer, such as padding, stride, and/or the ratio of bilinear function. However, the chunk size (e.g., the number of scan lines) at each layer is also limited by the data dependency (e.g., the number of scan lines that need to be kept during the process). For example, at Convl, although the data dependency is reduced from 752 scan lines in FIG. 1 to 78 scan lines in FIG. 4, such a data dependency may still be a bottleneck in reducing the required memory size in view of the setting of Convl (e.g., kernel size = 3).

[0035] In some embodiments, a picture frame is processed by the convolution neural network layers of U-Net 100 using the deep-first algorithm. For example, the picture frame may be processed using a first convolution neural network layer for a few scan lines util there is enough scan lines to kick off a second convolution neural network layer and accumulates enough scan lines to kick off a third convolution neural network layer. If not, U-Net 100 may go back to process picture frame using the first convolution neural network layer and then the second convolution neural network layer and accumulate enough scan lines for kicking off the third convolution neural network layer. [0036] FIG. 5 illustrates a scheme of processing picture frames using U-Net 100, according to some embodiments of the present disclosure. As shown in FIG. 5, a first picture frame t (the current picture frame) is received from the input layer and processed using each of the convolution neural network layers Convl, Conv2, and Conv3 in encoder side 102 of U-Net 100. As opposed to processing the first picture frame t (e.g., the output of Conv4) using all convolution neural network layers in encoder side 102 of U-Net 100 as shown in the schemes of FIGs. 1 and 4, the first picture frame t at at least one convolutional neural network layer Conv4 (e.g., the output of Conv3) may be stored into a memory (e.g., an internal memory (e.g., SRAM) of an SoC implementing the scheme of FIG. 5 or an external memory (e.g., DRAM) outside of the SoC). Instead, a second picture frame t-1 (a previous picture frame occurring immediately before the current picture frame) is retrieved from the memory (e.g., the internal memory of the SoC or the external memory outside of the SoC), and is processed using at least one convolutional neural network layer Conv4 of U-Net 100. Since adjacent picture frames in a video are very similar most of the time, the second picture frame t-1 can replace the first picture frame t at at least one convolutional neural network layer Conv4 without sacrificing the quality of U-Net 100 much. On the other hand, by replacing the first picture frame t with the second picture frame t- 1 at at least one convolutional neural network layer Conv4, the data dependency between the convolutional neural network layer Conv4 and the previous convolutional neural network layer Conv3 is cut off. [0037] Moreover, in some embodiments, to further reduce the data dependency in U-Net 100, as opposed to processing the second picture frame t-1 (e.g., the output of Conv4) using the global pooling layer, the second picture frame t-1 at the global pooling layer (e.g., the output of Conv4) may be stored into a memory (e.g., the internal memory of the SoC or the external memory outside of the SoC). Instead, a third picture frame t-2 (another previous picture frame occurring immediately before the previous picture frame) is retrieved from the memory (e.g., the internal memory of the SoC or the external memory outside of the SoC), and is processed using the global pooling layer of U-Net 100. Since the global pooling layer of U-Net 100 is configured to process global information of the first, second, and third picture frames t, t-1, and t-2, such as lighting, average luminance, or global coloring of the environment, the third picture frame t-2 can replace the first picture frame t or the second picture frame t-1 at the global pooling layer without sacrificing the quality of U-Net 100. On the other hand, by replacing the first picture frame t or the second picture frame t-1 with the third picture frame t-2 at the global pooling layer, the data dependency between the global pooling layer and the previous convolutional neural network layer Conv4 is also cut off. For example, as shown in FIG. 5, the data dependency between layers in encoder side 102 of U-Net 100 is illustrated as the number of scan lines that need to be kept during the process, such as 8 for Conv4, 4 for Conv3, 10 for Conv2, 22 for Conv 1, and 46 for the input layer, which is further reduced compared with the scheme shown in FIG. 4.

[0038] It is understood that in some examples, more previous picture frames may be used, for example, at convolution neural network layers in encoder side 102 of U-Net 100 in the similar manner as shown in FIG. 5 to further cut the data dependency and reduce the required memory size for storing the data during the process. It is also understood that in some examples, the data dependency may not be cut off at the global pooling layer. It is further understood that the previous picture frame is not limited to the picture frame that occurs immediately before the current frame and may include any picture frame occurring before the current frame. Nevertheless, at least one previous picture frame occurring before the current picture frame may be processed using at least one of the global pooling layer or a convolution neural network layer in encoder side 102 of U-Net 100 to cut the data dependency and reduce the required memory size for storing the data during the process.

[0039] As shown in FIG. 5, the first picture frame t is also processed using a first skip connection layer Skipl_Convl or Skipl_Convl of U-Net 100. The output of the first skip connection layer Skipl_Convl or Skipl_Convl and the output of a corresponding up-convolution neural network layer UpConv4 or UpConv2 are concatenated using a corresponding concatenate node Concat3 or Concat2. Moreover, as shown in FIG. 5, the second picture frame t-1 is also processed using a second skip connection layer Skip3_Convl of U-Net 100. The output of the second skip connection layer Skip3_Convl and the output of the global pooling layer is concatenated using a corresponding concatenate node Concatl.

[0040] It is understood that since the data dependency affects encoder side 102, but not decoder side 104, of U-Net 100, the “cut” of U-Net 100 (e.g., processing multiple picture frames using different hidden layers) may be made “asymmetrically,” i.e., on encoder side 102, but not decoder side 104, of U-Net 100. In other words, FIG. 5 illustrates one example of “asymmetric cut” of U-Net 100 into three parts on encoder side 102 for processing three picture frames.

[0041] FIG. 6 illustrates a system 600 implementing the schemes of processing picture frames using machine learning, according to some embodiments of the present disclosure. System 600 may be a module of a user equipment or any other device, such as a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, etc. System 600 may include a camera 610, an external memory 620, and an SoC 630, connected by a bus 640. Camera 610 may be a digital camera, such as a camera configured to provide RGGB sense data. External memory 620 may be outside of SoC 630 and may include a DRAM memory. By contrast, SoC 630 may include a processor 632 and internal memory 634, connected by an internal bus 636. Internal memory 634 may be an SRAM. Although internal bus 636 and bus 640 are shown as connecting the illustrated components, other connections are also possible. System 600 may transfer data from external memory 620 to the processor and/or internal memory 634. Internal memory 634 may be one or more magnitudes of order smaller than external memory 620.

[0042] System 600 may be configured to process picture frames using machine learning by implementing any suitable schemes disclosed herein. In some embodiments, processor 632 of SoC 630 is configured to process a first picture frame using a first convolutional neural network layer of a machine learning model (e.g., U-Net 100), and process a second picture frame using a second convolutional neural network layer or a pooling layer (e.g., a global pooling layer) of the machine learning model, as described above in detail with respect to FIGs. 4 and 5. In some embodiments, processor 632 of SoC 630 is configured to process the second picture frame using the second convolutional neural network layer, and process a third picture frame using the pooling layer. In some embodiments, processor 632 of SoC 630 is further configured to process the first picture frame using a first skip connection layer of the machine learning model, process the second picture frame using a second skip connection layer of the machine learning model, concatenate an output of the second skip connection layer and an output of the pooling layer, and concatenate an output of the first skip connection layer and an output of a corresponding up-convolutional neural network layer of the machine learning model.

[0043] In some embodiments, processor 632 of SoC 630 is further configured to process a chunk of the first picture frame using the first convolutional neural network layer. In some embodiments, internal memory 634 of SoC 630 is configured to store an output of the first convolutional neural network layer based on the chunk of the first picture frame. That is, by cutting the data dependency between the pooling layer and convolutional neural network layers or between different convolutional neural network layers, the chunk of data needed to be stored in internal memory 634 of SoC 630 can be reduced.

[0044] In some embodiments, internal memory 634 of SoC 630 is configured to store the second picture frame and/or the third picture frame, and processor 632 of SoC 630 is further configured to retrieve the second picture frame and/or the third picture frame from internal memory 634. In some embodiments, external memory 620 is configured to store the second picture frame and/or the third picture frame, and processor 632 of SoC 630 is further configured to retrieve the second picture frame and/or the third picture frame from external memory 620. That is, the previous picture frame(s) can be stored in either internal memory 634 of SoC 630 or external memory 620 outside of SoC 630 from which processor 632 of SoC 630 can retrieve when processing the current picture frame.

[0045] It is understood that SoC 630 may be implemented as any suitable chips of system 600, such as a neural processing unit (NPU, a.k.a., neural processor) for image enhancement, video editing, object detection and recognition, audio noise reduction, speech recognition, and NLP, etc. using machine learning.

[0046] FIG. 7 illustrates a flow chart of a method 700 for processing picture frames using machine learning, according to some embodiments of the present disclosure. FIG. 8 illustrates a flow chart of a detailed method 800 for processing picture frames using machine learning, according to some embodiments of the present disclosure. Examples of the apparatus that can perform operations of methods 700 and 800 include, for example, system 600 depicted in FIG. 6 or any other apparatus disclosed herein. It is understood that the operations shown in methods 700 and 800 are not exhaustive and that other operations can be performed as well before, after, or between any of the illustrated operations. Further, some of the operations may be performed simultaneously, or in a different order than shown in FIGs. 7 and 8. It is further understood some operations may be skipped, such as operation 720 or 730 of method 700.

[0047] Referring to FIG. 7, method 700 starts at operation 710, in which a first picture frame is processed using a first convolutional neural network layer of a machine learning model. The machine learning model may include a U-Net, and the first convolutional neural network layer may be in an encoder side of the U-Net. Operation 710 may be processor 632 of SoC 630. In some embodiments, at operation 802 of method 800, processor 632 may process a chunk of the first picture frame using the first convolutional neural network layer of U-Net 100. At operation 804, processor 632 may store an output of the first convolutional neural network layer based on the chunk of the first picture frame into internal memory 634 of SoC 630. The chunk of the first picture frame may include a number of scan lines of the first picture frame. The number of the scan lines may be determined based on a size of a kernel used by the first convolutional neural network layer. At operation 806, processor 632 may process the first picture frame using a first skip connection layer of U-Net 100.

[0048] Referring to FIG. 7, method 700 proceeds to operation 720 in which a second picture frame is processed using a second convolutional neural network layer of the machine learning model. The second picture frame may occur before the first picture frame, such as immediately before the first picture frame. The second convolutional neural network layer may be processed after the first convolutional neural network layer in the machine learning model. Operation 720 may be processor 632 of SoC 630. In some embodiments, at operation 808 of method 800, processor 632 may retrieve the second picture frame from internal memory 634 of SoC 630 or from external memory 620. At operation 810, processor 632 may process the second picture frame using the second convolutional neural network layer of U-Net 100. At operation 812, processor 632 may process the second picture frame using a second skip connection layer of U-Net 100.

[0049] Referring to FIG. 7, method 700 proceeds to operation 730 in which a third picture frame is processed using a pooling layer of the machine learning model. The third picture frame may occur before the second picture frame. The pooling layer may be processed after the second convolutional neural network layer in the machine learning model. The pooling layer is a global pooling layer configured to process global information of the first, second, and third picture frames. Operation 730 may be processor 632 of SoC 630. In some embodiments, at operation 814 of method 800, processor 632 may retrieve the third picture frame from internal memory 634 of SoC 630 or from external memory 620. At operation 816, processor 632 may process the third picture frame using the pooling layer (e.g., a global pooling layer) of U-Net 100. At operation 818, processor 632 may concatenate an output of the second skip connection layer and an output of the pooling layer (e.g., the global pooling layer). At operation 820, processor 632 may concatenate an output of the first skip connection layer and an output of a corresponding up-convolutional neural network layer of U-Net 100.

[0050] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as processor 632 in FIG. 6. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices, Flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, digital versatile disk (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0051] According to one aspect of the present disclosure, an apparatus for processing picture frames using machine learning includes an SoC. The SoC includes a processor configured to process a first picture frame using a first convolutional neural network layer of a machine learning model, and process a second picture frame using a second convolutional neural network layer or a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

[0052] In some embodiments, the processor is configured to process the second picture frame using the second convolutional neural network layer, and process a third picture frame using the pooling layer. In some embodiments, the pooling layer is processed after the second convolutional neural network layer in the machine learning model. In some embodiments, the third picture frame occurs before the second picture frame.

[0053] In some embodiments, the second picture frame occurs immediately before the first picture frame, and the third picture frame occurs immediately before the second picture frame.

[0054] In some embodiments, the pooling layer is a global pooling layer configured to process global information of the first, second, and third picture frames.

[0055] In some embodiments, the processor is further configured to process the first picture frame using a first skip connection layer of the machine learning model, process the second picture frame using a second skip connection layer of the machine learning model, concatenate an output of the second skip connection layer and an output of the pooling layer, and concatenate an output of the first skip connection layer and an output of a corresponding up-convolutional neural network layer of the machine learning model.

[0056] In some embodiments, the machine learning model includes a U-Net, and the first convolutional neural network layer, the second convolutional neural network layer, and the pooling layer are in an encoder side of the U-Net. [0057] In some embodiments, to process the first picture frame, the processor is further configured to process a chunk of the first picture frame using the first convolutional neural network layer. In some embodiments, the SoC further includes an internal memory configured to store an output of the first convolutional neural network layer based on the chunk of the first picture frame. [0058] In some embodiments, the chunk of the first picture frame includes a number of scan lines of the first picture frame, and the number of the scan lines is determined based on a size of a kernel used by the first convolutional neural network layer.

[0059] In some embodiments, the internal memory is further configured to store the second picture frame. In some embodiments, the processor is further configured to retrieve the second picture frame from the internal memory.

[0060] In some embodiments, the apparatus further includes an external memory configured to store the second picture frame. In some embodiments, the processor is further configured to retrieve the second picture frame from the external memory.

[0061] According to another aspect of the present disclosure, a method for processing picture frames using machine learning is disclosed. A first picture frame is processed by a processor of a SoC using a first convolutional neural network layer of a machine learning model. A second picture frame is processed by the processor using a second convolutional neural network layer or a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model.

[0062] In some embodiments, the second picture frame is processed by the processor using the second convolutional neural network layer, and a third picture frame is processed by the processor using the pooling layer. In some embodiments, the pooling layer is processed after the second convolutional neural network layer in the machine learning model. In some embodiments, the third picture frame occurs before the second picture frame.

[0063] In some embodiments, the second picture frame occurs immediately before the first picture frame, and the third picture frame occurs immediately before the second picture frame.

[0064] In some embodiments, the global pooling layer is a global pooling layer configured to process global information of the first, second, and third picture frames.

[0065] In some embodiments, the first picture frame is processed by the processor using a first skip connection layer of the machine learning model, the second picture frame is processed by the processor using a second skip connection layer of the machine learning model, an output of the second skip connection layer and an output of the pooling layer are concatenated by the processor, and an output of the first skip connection layer and an output of a corresponding up- convolutional neural network layer of the machine learning model are concatenated by the processor.

[0066] In some embodiments, the machine learning model includes a U-Net, and the first convolutional neural network layer, the second convolutional neural network layer, and the pooling layer are in an encoder side of the U-Net.

[0067] In some embodiments, to process the first picture frame, a chunk of the first picture frame is processed using the first convolutional neural network layer. In some embodiments, an output of the first convolutional neural network layer is stored by an internal memory of the SoC based on the chunk of the first picture frame.

[0068] In some embodiments, the chunk of the first picture frame includes a number of scan lines of the first picture frame, and the number of the scan lines is determined based on a size of a kernel used by the first convolutional neural network layer.

[0069] According to yet another aspect of the present disclosure, a non-transitory computer-readable medium is encoded with instructions that, when executed by a processor, cause the processor to process a first picture frame using a first convolutional neural network layer of a machine learning model, and process a second picture frame using a second convolutional neural network layer and a pooling layer of the machine learning model. The second picture frame occurs before the first picture frame. The second convolutional neural network layer is processed after the first convolutional neural network layer in the machine learning model. The pooling layer is processed after the second convolutional neural network layer in the machine learning model.

[0070] In some embodiments, the instructions, when executed by the processor, further cause the processor to process the second picture frame using the second convolutional neural network layer, and process a third picture frame using the pooling layer. In some embodiments, the third picture frame occurs before the second picture frame.

[0071] The foregoing description of the specific embodiments will so reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

[0072] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

[0073] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0074] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[0075] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.