Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
REAL-TIME VIDEO SUPER-RESOLUTION FOR MOBILE DEVICES
Document Type and Number:
WIPO Patent Application WO/2023/229644
Kind Code:
A1
Abstract:
This application is directed to video super-resolution. An electronic device obtains a prior input image and a current input image that follows the prior input image in a sequence of image frames having a first resolution. A residual block based network is applied to generate a prior output feature based on the prior input image. The current input image and the prior output feature are combined to generate a current input feature. The residual block based network is applied to generate a current output feature based on the current input feature, and the current output feature is converted to a current output image having a second resolution, the second resolution greater than the first resolution.

Inventors:
CAI JIE (US)
MENG ZIBO (US)
DING JIAMING (US)
HO CHIU (US)
Application Number:
PCT/US2022/053987
Publication Date:
November 30, 2023
Filing Date:
December 23, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T3/40; G06N3/02
Foreign References:
US20220101497A12022-03-31
US20200372609A12020-11-26
US20180338159A12018-11-22
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. An image processing method, implemented at an electronic device including one or more processors and memory, comprising: obtaining a prior input image and a current input image that follows the prior input image in a sequence of image frames having a first resolution; applying a residual block based network to generate a prior output feature based on the prior input image; combining the current input image and the prior output feature to generate a current input feature; applying the residual block based network to generate a current output feature based on the current input feature; and converting the current output feature to a current output image having a second resolution, the second resolution greater than the first resolution.

2. The method of claim 1, combining the current input image and the prior output feature further comprising: converting the first resolution of the current input image based on a resolution of the prior output feature generated based on the prior input image; concatenating the current input image and the prior output feature to generate a concatenated input image; and extracting the current input feature from the concatenated input image.

3. The method of claim 1 or 2, wherein the residual block based network includes an input interface, an output interface, and a plurality of distinct residual blocks that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection.

4. The method of claim 1 or 2, wherein the residual block based network includes an input interface, an output interface, and a plurality of identical residual block groups that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection, each identical residual block group including a plurality of distinct residual blocks that are coupled in series.

5. The method of claim 1 or 2, wherein the residual block based network includes an input interface, an output interface, and a plurality of residual block groups that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection, each residual block group made of a plurality of distinct residual blocks that are coupled in series according to a respective distinct order.

6. The method of any of claims 1-5, wherein the sequence of image frames is started with the prior input image, the method further comprising: creating an initial feature map, all elements of the initial feature map equal to 0; combining the prior input image and the initial feature map to generate a prior input feature; and generating the prior output feature from the prior input feature using the residual block based network.

7. The method of any of claims 1-5, the prior input image including a first input image, the prior output feature including a first prior output feature, the method further comprising: obtaining a second input image, the first input image following the second input image; and applying the residual block based network to generate a second prior output feature based on the second input image, wherein the current input image is combined with both the first and second prior output features to generate the current input feature.

8. The method of any of claims 1-7, wherein the current input image immediately follows the prior input image in the sequence of image frames.

9. The method of any of claims 1-8, wherein each image of the sequence of image frames includes a respective RGB color image, the method further processing, before obtaining the prior and current input images: obtaining a sequence of raw images including a prior raw image and a current raw image, the sequence of raw images captured by an image sensor array; and performing, by an image signal processor (ISP), image processing operations on the prior and current raw images to generate the prior and current input images, respectively.

10. The method of any of claims 1-8, wherein each image of the sequence of image frames includes a respective raw image captured by an image sensor array, the method further comprising: after converting the current output feature to the current output image, performing, by an ISP, image processing operations on the current output image to generate a current RGB color image.

11. The method of any of claims 1-10, further comprising: obtaining a next input image that follows the current input image in the sequence of image frames; combining the next input image and the current output feature to generate a next input feature; applying the residual block based network to generate a next output feature based on the next input feature; and converting the next output feature to a next output image having the second resolution.

12. The method of any of claims 1-11, wherein: the residual block based network has a plurality of layers and includes a plurality of weights associated with a respective number of filters of each layer; and the plurality of weights are quantized an int8, uinl8. intl6 or uintl6 format based on a precision setting of an electronic device.

13. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-12.

14. A non-transitory computer-readable storage medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-12.

Description:
Real-Time Video Super-Resolution for Mobile Devices

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation and claims priority to International Patent Application No. PCT/US2022/030918, titled “Real-Time Video Super-Resolution for Mobile Devices,” filed May 25, 2022, International Patent Application No. PCT/US2022/030919, titled “Deep Learning Based Video Super-Resolution,” filed May 25, 2022, and International Patent Application No. PCT/US2022/030924, titled “Real Scene Super-Resolution with Raw Images for Mobile Devices,” filed May 25, 2022. The contents of each of the above applications are herein incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable storage media for generating high resolution visual data from low resolution visual data to restore visual details during image super-resolution (ISR) or video super-resolution (VSR).

BACKGROUND

[0003] Deep learning solutions have been applied in image or video super-resolution. For example, a super-resolution convolutional neural network (SRCNN) includes three layers and is used in ISR. Another super-resolution generative adversarial network (SRGAN) uses an adversarial loss to implement ISR. Residual-in-residual dense blocks (RRDB) are employed without batch normalization as the basic network building unit and relativistic generative adversarial network (GAN) to train a generator. These deep learning techniques have enhanced speed, restored video information in VSR, and achieved appealing improvements by reusing some existing components (e.g., propagation, alignment, aggregation, and upsampling) with minimal redesigns. In another example, a pyramid, cascading, and deformable (PCD) alignment module aligns frames at a feature level using deformable convolutions in a coarse-to-fine manner. A temporal and spatial attention (TSA) fusion module is applied to emphasize important features for subsequent restoration in both temporal and spatial domains. Spatial and temporal contexts are optionally integrated from continuous video frames using a recurrent encoder-decoder module. In particular, an end-to- end trainable frame-recurrent framework may be applied to warp a previously inferred high resolution frame to estimate and super-resolve a subsequent frame.

[0004] Existing deep learning solutions focus on achieving highest fidelity scores, and are not optimized for computational efficiency in terms of a number of parameters and floating-point operations per second (FLOPS). The aforementioned solutions contain millions of parameters, require 600G to 6000G FLOPS, and take several seconds to infer a single low resolution frame using a graphics processing unit (GPU). These deep learning solutions cannot be deployed on mobile devices having limited computational resources. Additionally, existing deep learning solutions do not consider complicated noise schemes in ISR and VSR thoroughly and accurately, and inevitably compromise an accuracy of associated ISR and VSR. It would be beneficial to have an effective, efficient, and accurate mechanism to implement ISR and VSR at an electronic device, particularly at a mobile device having limited computational resources.

SUMMARY

[0005] Various embodiments of this application are directed to generating high resolution visual data from low resolution visual data during ISR and VSR efficiently and accurately. In some embodiments, a shuffled recursive residual network (SRRN) is applied to implement efficient ISR and VSR. The SRRN relies on recursive learning that controls model parameters while increasing a depth and a random shuffle technique used to increase a network generalization ability. In some embodiments, raw images captured by a camera are used to restore high-resolution clear images. More information can be exploited in a raw domain, because the raw images are typically recorded in a 10 or 12 bit format, whereas a color image (e.g., an RGB or YUV image produced by ISP) are typically stored in an 8 bit format. Also, the raw images have not been processed by an image signal processor (ISP) that contains nonlinear processing (e.g., tone mapping, Gamma correction, blurring, noise filtering) in an RGB or YUV space, and therefore, are not impacted by difficulties caused by the non-linear processing to image or video restoration in the RGB/YUV domain. ISR or VSR of the raw images is configured to perform real-time inference efficiently on mobile devices.

[0006] Additionally, in some embodiments, an end-to-end trainable frame-recurrent VSR framework is applied, e.g., in the raw domain. This framework utilizes a previously inferred high resolution frame to super-resolve a subsequent low resolution frame. This naturally encourages temporally consistent results and reduces a computational cost by warping an image in each step. Furthermore, this frame-recurrent VSR framework has the ability to assimilate a large number of previous frames without increasing computational demands. By these means, some implementations of this application provide efficient and effective deep learning solutions that can be deployed on edge devices (e.g., mobile devices) to enhance runtime, parameter size, FLOPs, activations, and memory consumption of ISR and VSR.

[0007] In one aspect, an image processing method is implemented at an electronic device including one or more processors and memory. The method includes obtaining an input image having a first resolution and extracting an image feature map from the input image. The method further includes processing the image feature map with an SRRN to generate an output feature map. The SRRN includes a first residual block group and a second residual block group coupled to the first residual block group, and each of the first and second residual block groups is made from all of a plurality of residual blocks. The method further includes converting the output feature map to an output image having a second resolution that is greater than the first resolution. The plurality of residual blocks are coupled in series according to a first order and a second order to form the first residual block group and the second residual block group, respectively. The second order is distinct from the first order.

[0008] In another aspect, an image processing method is implemented at an electronic device including one or more processors and memory. The method includes obtaining a prior input image and a current input image that follows the prior input image in a sequence of image frames having a first resolution. The method further includes applying a residual block based network to generate a prior output feature based on the prior input image. The method further includes combining the current input image and the prior output feature to generate a current input feature, applying the residual block based network to generate a current output feature based on the current input feature, and converting the current output feature to a current output image having a second resolution. The second resolution is greater than the first resolution.

[0009] In yet another aspect, an image processing method is implemented at an electronic device including one or more processors and memory. The method includes obtaining a prior input image and a current input image that follows the prior input image in a sequence of image frames having a first resolution and applying a residual block based network to generate a prior output image based on the prior input image. The prior output image has a second resolution greater than the first resolution. The method further includes predicting a first output image from the prior output image and a current optical flow map, and the current optical flow map describes image motion between the prior and current input images. The method further includes combining the current input image and the first output image to generate a combined input image, applying the residual block based network to generate a current output feature based on the combined input image, and converting the current output feature to a current output image having the second resolution.

[0010] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0011] In yet another aspect, some implementations include a non-transitory computer-readable storage medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0012] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.

Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0014] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0015] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.

[0016] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0017] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments. [0018] Figure 5 is a flow diagram of an example image processing method for increasing an image resolution (i.e., for image or video super-vision), in accordance with some embodiments.

[0019] Figures 6A and 6B are two flow diagrams of example image processing methods for increasing an image resolution (i.e., image or video super-vision), in accordance with some embodiments.

[0020] Figure 7 is a flow diagram of an image processing process that increases an image resolution for ISR or VSR using an example SRRN, in accordance with some embodiments.

[0021] Figures 8A-8C are block diagrams of three example SRRNs having three distinct shuffling schemes of residual blocks RB1-RB4, in accordance with some embodiments.

[0022] Figure 9 is a block diagram of an example neural network for determining distinct orders of residual blocks in residual block groups of an SRRN, in accordance with some embodiments.

[0023] Figure 10 is a flow diagram of an image processing process that increases an image resolution for ISR or VSR using a residual block based network, in accordance with some embodiments.

[0024] Figure 11 A is a block diagram of a residual block, in accordance with some embodiments. Figure 1 IB is a block diagram of an example residual block based network including a sequence of residual blocks, in accordance with some embodiments. Figure 11C is a block diagram of another example residual block based network including a sequence of identical residual block groups, in accordance with some embodiments. Figure 1 ID is a block diagram of another example residual block based network including an SRRN, in accordance with some embodiments.

[0025] Figure 12 is a flow diagram of an image processing process that increases an image resolution of a prior input image for ISR or VSR using a residual block based network, in accordance with some embodiments.

[0026] Figure 13 is a flow diagram of an image processing process that increases an image resolution of a next input image for ISR or VSR using a residual block based network, in accordance with some embodiments.

[0027] Figure 14 is a flow diagram of another example image processing process that increases an image resolution for ISR or VSR using a residual block based network, in accordance with some embodiments. [0028] Figure 15 is a flow diagram of another example image processing process that increases an image resolution based on an optical flow map, in accordance with some embodiments.

[0029] Figure 16 is a flow diagram of another example image processing process that increases an image resolution for ISR or VSR using a residual block based network, in accordance with some embodiments.

[0030] Figure 17 is a flow diagram of an example image processing method for improving image quality using an SRRN, in accordance with some embodiments.

[0031] Figure 18 is a flow diagram of an example image processing method for improving image quality using an SRRN, in accordance with some embodiments.

[0032] Figure 19 is a flow diagram of an example image processing method for improving image quality using an optical flow map, in accordance with some embodiments. [0033] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0034] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0035] Image or video super-resolution aims at recovering a high resolution (HR) image or video from a corresponding low-resolution (LR) image or video. As high definition (HD) and ultra-high definition (UHD) display devices are widely applied, super-resolution attracts attention and becomes a critical function of many media related user applications. In some embodiments, a shuffled recursive residual network (SRRN) is applied to implement efficient ISR and VSR. Recursive learning controls model parameters while increasing a depth, and random shuffling increases a network generalization ability. In some embodiments, raw images captured by a camera are used to restore high-resolution clear images, because more visual information can be exploited directly in a raw domain without being compromised by linear or nonlinear processing implemented by the ISP of the camera. Additionally, in some embodiments, a frame-recurrent VSR framework is applied in the raw domain, and uses a previously inferred high resolution frame to super-resolve a subsequent low resolution frame. By these means, some implementations of this application provide efficient and effective deep learning solutions that are based on recursive blocks, and can enable VSR in real time (e.g., at a rate of 30 frames per second (FPS)) on mobile devices having limited power, computational, and storage resources.

[0036] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0037] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0038] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0039] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models (e.g., a residual block based network 1010 or 1410, an optical flow network 1416) are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0040] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0041] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0042] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by a client device 104. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. When the client device 104 has a limited computational capability, training of the first or second data processing models is optionally implemented by the server 102, while inference of the device poses and visual content is implemented by the client device 104. In an example, the second data processing model includes an image processing model for ISR or VSR, and is implemented in a user application (e.g., a social networking application, a social media application, a short video application, and a media play application).

[0043] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., a mobile phone 104C in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.

[0044] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction; • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for obtaining training data and establishing a data processing model 250 for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 230 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 230 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 230 is applied to increase an image resolution using an image processing model (e.g., model 515 in Figure 5, 615 in Figure 6A, model 706 in Figure 7, model 1022 in Figure 10, and model 1418 in Figure 14); and

• One or more databases 250 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 includes an image processing model (e.g., model 515 in Figure 5, 615 in Figure 6 A, model 706 in Figure 7, model 1022 in Figure 10, and model 1418 in Figure 14) configured to implement ISR/VSR; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 to provide the associated results to be presented on client device 104.

[0045] Optionally, the one or more databases 250 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 250 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0046] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0047] Figure 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 230 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 230 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 238 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 230 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 238 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 230 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0048] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 238 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 238 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 238 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 238 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 230 to process the content data.

[0049] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0050] The data processing module 230 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre- processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0051] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, ws, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0052] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0053] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0054] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 230, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly. [0055] The training process is a process for calibrating all of the weights w< for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0056] Figure 5 is a flow diagram of an example image processing method 500 for increasing an image resolution (i.e., for image or video super-vision), in accordance with some embodiments. The image processing method 500 is implemented by an electronic device (e.g., a data processing module 230 of a mobile phone 104C). The electronic device obtains an input image 502 having a first resolution JF) and generates an output image 504 having a third resolution (e.g., nH^nW, where n is a positive integer) that is greater than the first resolution. The input image 502 is optionally a static image or an image frame of a video clip. In some situations, the input image 502 is received via one or more communication networks 108 and in a user application 224, e.g., a social networking application, a social media application, a short video application, and a media play application. Examples of this user application 224 include, but are limited to, Tiktok, Kuaishou, WeChat, Tencent video, iQiyi, and Youku. Given a limited signal transmission bandwidth, a server 102 associated with the user application 224 streams low-resolution visual data including the input image 502 to electronic devices distributed at different client nodes. If displayed without ISR, the input image 502 would result in poor user experience for users of the user application 224. In an example, the input image 502 is part of a low- resolution video stream provided to unpaid users of a media play application. VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence. As such, the image processing method 500 uses low-resolution information of the input image 502 and associated adjacent temporal information to predict missing information of the input image 502, which leads to a high-resolution video sequence including the output image 504. In some embodiments, the image processing method 500 also enhances a quality of the input image 502, e.g., by reducing noise, blurriness, and artifacts therein.

[0057] The input image 502 includes a plurality of image components (e.g., three components 502A, 502B, and 502C). For example, the three image components 502A, 502B, and 502C correspond to a luminance component (Y) and two chrominance components (U and V) of the input components, respectively. The electronic device separates an image component 502A (e.g., the luminance component (Y)) from one or more remaining components 502B and 502C of the input image 502. The image component 502 A has the first resolution (e.g., HxW). Optionally, the image component 502A has a single channel. The electronic device extracts an image feature map 506 from the image component 502A. The image feature map 506 has a second resolution (e.g., H A/ ' W /W, where m is a positive integer) that is equal to or less than the first resolution. In an example, the image feature map 506 is expanded to a plurality of channels, e.g., 9 or 32 channels. The image feature map 506 is further processed by a plurality of successive recursive blocks 508 (e.g., two successive recursive blocks 508A and 508B that are coupled in series) to generate an output feature map 510. Each recursive block 508 includes a plurality of residual units 512 and a skip connection 514 coupling an input of the recursive block 508 to an output of the recursive block 508. The electronic device converts the output feature map 510 to an output component 504A having a third resolution (e.g., 3H*3W) that is greater than the first resolution.

[0058] The output component 504 A and the one or more remaining components 502B and 502C of the input image 502 are combined to generate an output image 504 having the third resolution. Each of the one or more remaining components 502B and 502C has the first resolution. The third resolution is equal to a multiplication of the first resolution by a scale number. In some embodiments, each pixel in the one or more remaining components 502B and 502C corresponds to a pixel group having the scale number of pixels and including the respective pixel itself. A component value corresponding to each pixel is spread to cover the entire pixel group. For example, the first and third resolutions are x W and 3H*3 V, respectively. Each pixel in the component 502B or 502C corresponds to 9 respective immediately adjacent pixels in a counterpart component 504B or 504C, respectively. The component value of each pixel in the component 502B or 502C is therefore used as component values of the 9 pixels in a counterpart component 504B or 504C, respectively. Alternatively, in some embodiments, the 9 pixels in each counterpart component 504B or 504C are organized in a 3x3 pixel array. The component value of each pixel in the component 502B or 502C is therefore used as a component value of a center pixel of the 9 pixels in the counterpart component 504B or 504C, respectively. Other pixels in the 9 pixels are interpolated from two closest center pixels of two pixel groups in the counterpart components 504B or 504C based on relative distances to the two closest center pixels. After the counterpart components 504B and 504C having the third resolution are determined, the output component 504 A and the counterpart components 504B and 504C are combined to generate the output image 504 having the third resolution.

[0059] The image processing process 500 is implemented based on an image processing model 515 that includes a feature extraction model 516 and an output conversion module 518 in addition to the plurality of recursive blocks 508. The feature extraction model 516 is configured to extract the image feature map 506 from the image component 502A. In an example, a 3x3 convolution layer is followed by one rectified linear activation unit (ReLU) layer to extract shallow features represented in a 9-channel input feature map 520. Another 3x3 convolution layer followed by one ReLU layer is optionally applied to extract additional features represented in the image feature map 506, e.g., having 32 channels. The output conversion module 518 is coupled to an output of the plurality of recursive blocks 508 and configured to convert the output feature map 510 to the output component 504 A. In the output conversion module 518, a 3x3 convolution layer is followed by one ReLU layer to convert the output feature map 510 having 32 channels to a 9-channel intermediate feature map 522. The input feature map 520 and intermediate feature map 522 are combined on an element-by-element basis and processed by a depth space model 524 (also called a pixel shuffle layer) to generate the output component 504A of the output image 504.

[0060] The plurality of recursive blocks 508 include a first number of successive recursive blocks 508 A and 508B. Each recursive block 508 includes a second number of residual units 512 that are successively coupled to each other and in series. Each residual unit 512 optionally includes a CNN (e.g., having two 3x3 convolutional layers) and a rectified linear unit (ReLU) layer. The first number is less than a recursive block threshold, and the second number is less than a residual unit threshold. The recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device. In some embodiments, the image processing model 515 includes the first and second numbers. The server 102 obtains information of the computational capability of the electronic device and determines the recursive block threshold and residual unit threshold based on the information of the computational capability of the electronic device. The server 102 further determines the first and second numbers for the image processing model 515 based on the recursive block threshold and residual unit threshold. The image processing model 515 is provided to the electronic device for ISR and VSR.

[0061] In some embodiments, referring to Figure 5, the plurality of recursive blocks 508 include two successive recursive blocks 508A and 508B, and each recursive block 508 includes 2 residual units 512. In an example, each residual block 508 includes two 3^3 convolution layers (pad 1, stride 1, and channel 32), and the first convolution layer is followed by a ReLU layer. A first recursive block 508A receives the image feature map 506 and generates a first block output feature map 526. A second recursive block 508B receives the first block output feature map 526 and generates the output feature map 510. The first block output feature map 526 and the output feature map 510 correspond to mid-level and high-level features of the image component 502 A of the input image 502. For each recursive block 508, a block input feature and a unit output feature of an output residual unit are combined to generate a block output feature at the output of the recursive block 508. For the first recursive block 508A, the image feature map 506 and a unit output feature of an output residual unit 512OA are combined to generate the first block output feature map 526. For the second recursive block 508B, the first block output feature map 526 is combined with a unit output feature of an output residual unit 512OB are combined to generate the output feature map 510.

[0062] In some embodiments, the image processing model 515 applied in the image processing process 500 is trained using a predefined loss function L. The predefined loss L function is a weighted combination of a pixel loss Lu, a structural similarity loss LSSIM, and a perceptual loss LVGG as follows: where Ay, i, and ZJ are weights for combining the losses Lu, LSSIM, and LVGG. The pixel loss Lu indicates a pixel-wise difference between a test output image and a ground truth image. The pixel-wise difference is optionally measured in an LI loss (i.e., a mean absolute error) or an L2 loss (i.e., a mean square error). The LI loss shows improved performance and convergence over the L2 loss. The pixel loss Lu is highly correlated with pixel-wise difference, and minimizing pixel loss directly maximizes a peak signal -to-noise ratio (PSNR). The structural similarity loss LSSIM indicates a structural similarity between the test output image and ground truth images based on comparisons of luminance, contrast, and structures. That said, the structural similarity loss LSSIM evaluates a reconstruction quality from a perspective of a human visual system. The perceptual loss LVGG indicates a semantic difference between the test output image and ground truth images using a pre-trained Visual Geometry Group (VGG) image classification network, thereby reflecting how a high frequency content is restored for perceptual satisfaction.

[0063] Quantization is applied to perform computation and store weights and biases at lower bit widths than a floating point precision. A quantized model executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, the image processing model 515 is quantized according to a precision setting of the electronic device where the image processing model 515 will be loaded. For example, the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 515 are quantized based on the lower precision. The quantized image processing model 515 result in a significant accuracy drop, and make image processing a lossy process. In some embodiments, the image processing model 515 is re-trained with the quantized weights and biases to minimize a loss function L.

[0064] In some embodiments, weights and biases associated with filters of the image processing model 515 maintain a float32 format, and are quantized based on a precision setting of the electronic device. For example, the weights and biases are quantized from the float32 format to an int8, uinl8. inti 6. or uintl6 format based on the precision setting of the electronic device. Specifically, in an example, the electronic device uses a CPU to run the image processing model 515, and the CPU of the electronic device processes 32 bit data. The weights and biases of the image processing model 515 are not quantized, and the image processing model 515 is provided to the electronic device directly. In another example, the electronic device uses one or more GPUs to run the image processing model 515, and the GPU(s) process 16 bit data. The weights and biases of the image processing model 515 are quantized to an intl6 format. In yet another example, the electronic device uses a DSP to run the image processing model 515, and the DSP processes 8 bit data. The weights and biases of the image processing model 515 are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 515 have fewer (multiply-accumulate) MAC operations and smaller size, and are hardware-friendly during deployment on the electronic device.

[0065] In an example, weights and biases of an image processing model 515 have a float32 format and are quantized to an uint8 format. Compared with the image processing model 515 having the float32 format, the quantized image processing model 515 only causes a loss of 0.2 dB on image information that is contained in the output image 504 created by super-resolution. However, the quantized image processing model 515 is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 FPS.

[0066] The image processing model 515 applied in the image processing process 500 is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone). Architecture of the image processing model 515 is designed according to the capabilities of the electronic device. In the present application, the image processing (i.e., VSR) method 500 is designed based on hardware friendly operations, e.g., using 8-bit quantization aware training (QAT) in a YUV domain. In some embodiments, VSR is applied to one or more color components in an RGB domain. The R, G, and B components correspond to red, green, and blue colors of a given pixel size. Alternatively, in some embodiments, VSR is applied to one or more color components in a YUV domain. A YUV color model defines a color space in terms of one luma component (Y) and two chrominance components including U (blue projection) and V (red projection). YUV encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components. A plurality of video devices, therefore, render directly using YUV or luminance/chrominance images. The most important component for YUV capture is the luminance or Y component. The Y component has a sampling rate greater than a distinct sampling rate of the U or V component. In some situations, the VSR is applied in only on Y channel in the image processing process 500. Such a VSR process is operated with 1/3 FLOPS, a third of the FLOPS applied to process the RGB color format. Extensive experiments show that VSR in the YUV domain achieves a greater super-resolution PSNR score, conserves mobile computing resources, and enhances a deployment efficiency of the image processing model 515.

[0067] Based on the image processing method 500, real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption. The image processing method 500 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR). The image processing model 515 applied in the image processing process 500 is robust to uint8 quantization and corresponds to only 0.2dB PSNR drop when compared with a float32 model built on DIV2K validation dataset. Moreover, VSR is implemented in the YUV domain, improving both signal quality, structural similarity, and visual perception of the input image 502 and model inference abilities of the image processing model 515.

[0068] Figures 6A and 6B are two flow diagrams of example image processing methods 600 and 650 for increasing an image resolution (i.e., image or video super-vision), in accordance with some embodiments. In optical photography, each pixel is a meta sample of an original image, and more samples provide a more detailed representation. The number of pixels in an input image is sometimes called a resolution. While a long-focus lens can be applied to provide a high-resolution input image, a range of a scene captured by the lens is usually limited by a size of a sensor array at an image plane. In some situations, a wide-range scene is captured at a lower resolution with a short-focus camera (e.g., a wide-angle lens) and apply the image processing method 600 or 650 to recovers high-resolution raw data from low-resolution version. As such, the image processing methods 600 and 650 are implemented on mobile devices based on real-time raw super-resolution models (e.g., models 615 and 615’), and such raw super-resolution models are established based on raw data degradation pipelines to recover high-resolution raw data with a high image fidelity.

[0069] Each of the image processing methods 600 and 650 is implemented by an electronic device (e.g., a mobile phone 104D). The electronic device obtains raw image data captured by image sensors of a camera. The raw image data includes an input image 602 having a first resolution (e.g., JF). The electronic device generates an output image 604 having a third resolution (e.g., 3H*3W n Figure 6A, 2H*2 V in Figure 6B) that is greater than the first resolution. The input image 602 is optionally a static image or an image frame of a video clip. In some situations, the input image 602 is captured by a camera of the electronic device. Alternatively, in some situations, the input image 602 is received via one or more communication networks 108 and in a user application, e.g., a social networking application, a social media application, a short video application, and a media play application. Examples of the user application include, but are limited to, Tiktok, Kuaishou, WeChat, Tencent video, iQiyi, and Youku. Given a limited signal transmission bandwidth, a server 102 associated with the user application streams low-resolution visual data including the input image 602 to electronic devices distributed at different client nodes. If displayed without ISR or VSR, the input image 602 would result in poor user experience for users of the user application. In an example, the input image 602 is part of a low-resolution video stream provided to unpaid users in a media play application. VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence. Each of the image processing methods 600 and 650 uses low-resolution information of the input image 602 and associated adjacent temporal information to predict missing information of the input image 602, which leads to a high-resolution video sequence including the output image 604. In some embodiments, each of the image processing methods 600 and 650 enhances a quality of the input image 602, e.g., by reducing noise, blurriness, and artifacts therein.

[0070] The electronic device extracts an image feature map 606 from the input image 602. The image feature map 606 has a second resolution (e.g., H/m^W/m, where m is a positive integer) that is equal to or less than the first resolution. In an example, the image feature map 606 is expanded to a plurality of channels, e.g., 9 or 32 channels. The image feature map 606 is further processed by a sequence of successive recursive blocks 608 to generate an output feature map 610. Each recursive block 508 includes a plurality of residual units 612 and a skip connection 614 coupling an input of the recursive block 608 to an output of the recursive block 608. The electronic device converts the output feature map 610 to the output image 604 having the third resolution (e.g., 3H*3W) that is greater than the first resolution. A color image 640 is further generated from the output image 604. The color image 640 has a color mode that is one of: PMS, RGB, CMYK, HEX, YUV, YCbCr, LAB, Index, Greyscale, and Bitmap.

[0071] The sequence of successive recursive blocks 608 include one or more recursive blocks 608. Referring to Figure 6A, the sequence of recursive blocks 608 includes two recursive blocks 608A and 608B coupled to each other and in series. Each recursive block 608 further includes two residual units 612 coupled to each other and in series. Feature maps processed in the successive recursive blocks 608A and 608B have the second resolution of J/x W and 32 channels. Referring to Figures 6B, the sequence of recursive blocks 608 includes a single recursive block 608C, and the recursive block has four or more residual units 612 that are coupled to each other and in series. Feature maps processed in successive recursive block 608C have the second resolution of H/2 W/2 and 32 channels.

[0072] Each of the image processing methods 600 and 650 is implemented based on a respective image processing model 615 or 615’ that includes a feature extraction model 616 and an output conversion module 618 in addition to the sequence of recursive blocks 608. The feature extraction model 616 is configured to extract the image feature map 606 from the input image 602. In an example (Figure 6A), a 3x3 convolution layer is followed by one ReLU layer to extract shallow features represented in a 9-channel input feature map 620. Another 3x3 convolution layer followed by one ReLU layer is optionally applied to extract additional features represented in the image feature map 606, e.g., having 32 channels. In another example (Figure 6B), a 3x3 convolution layer is followed by one ReLU layer to extract the image feature map 606 having the second resolution (H/2*W/2) and 32 channels. Referring to Figures 6A and 6B, the output conversion module 618 is coupled to an output of the sequence of recursive blocks 608 and configured to convert the output feature map 610 to the output image 604. In the output conversion module 618, a 3x3 convolution layer is followed by one ReLU layer to convert the output feature map 610 having 32 channels to a 9- channel intermediate feature map 622A in Figure 6A or a 16-channel intermediate feature map 622B in Figures 6B. Referring to Figure 6A, the input feature map 620 and intermediate feature map 622A are combined on a element-by-element basis and processed by a depth space model 624 (also called a pixel shuffle layer) to generate the output image 604 having the third resolution (e.g., 3H*3W). Alternatively, referring to Figure 6B, the intermediate feature map 622B is processed by a depth space model 624 (also called a pixel shuffle layer) to generate the output image 604 having the third resolution (e.g., 2H*2W).

[0073] Referring to Figure 6A, the sequence of recursive blocks 608 include a first number of successive recursive blocks 608 A and 608B. Each recursive block 608 includes a second number of residual units 612 that are successively coupled to each other and in series. Each residual unit 612 optionally includes a CNN (e.g., having two 3x3 convolutional layers) and a rectified linear unit (ReLU) layer. The first number is less than a recursive block threshold (e.g., 3), and the second number is less than a residual unit threshold (e.g., 6). The recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device. In some embodiments, the image processing model 615 includes the first and second numbers. The server 102 obtains information of the computational capability of the electronic device and determines the recursive block threshold and residual unit threshold based on the information of the computational capability of the electronic device. The server 102 further determines the first and second numbers for the image processing model 615 based on the recursive block threshold and residual unit threshold. The image processing model 615 is provided to the electronic device.

[0074] In some embodiments, referring to Figure 6A, the sequence of recursive blocks 608 include two successive recursive blocks 608A and 608B, and each recursive block 608 includes 2 residual units 612. In an example, each residual block 608 includes two 3x3 convolution layers (pad 1, stride 1, and channel 32), and the first convolution layer is followed by a ReLU layer. A first recursive block 608A receives the image feature map 606 and generates a first block output feature map 626. A second recursive block 608B receives the first block output feature map 626 and generates the output feature map 610. The first block output feature map 626 and the output feature map 610 correspond to mid-level and high-level features of the input image 602. For each recursive block 608, a block input feature and a unit output feature of an output residual unit are combined to generate a block output feature at the output of the recursive block 608. For the first recursive block 608 A, the image feature map 606 and a unit output feature of an output residual unit 6120 A are combined to generate the first block output feature map 626. For the second recursive block 608B, the first block output feature map 626 is combined with a unit output feature of an output residual unit 612OB are combined to generate the output feature map 610.

[0075] In some embodiments, the image processing models 615 and 615’ are trained using a predefined loss function L. The predefined loss L function is a weighted combination of a pixel loss Lu, a structural similarity loss LSSIM, and a perceptual loss LVGG based on equation (1) as follows: where 2i, 2, and /.i are weights for combining the losses Lu, LSSIM, and LVGG. The pixel loss Lu indicates a pixel-wise difference between a test output image and a ground truth image. The pixel-wise difference is optionally measured in an LI loss (i.e., a mean absolute error) or an L2 loss (i.e., a mean square error). The LI loss shows improved performance and convergence over the L2 loss. The pixel loss Lu is highly correlated with pixel-wise difference, and minimizing pixel loss directly maximizes a PSNR. The structural similarity loss LSSIM indicates a structural similarity between the test output image and ground truth images based on comparisons of luminance, contrast, and structures. That said, the structural similarity loss LSSIM evaluates a reconstruction quality from a perspective of a human visual system. The perceptual loss LVGG indicates a semantic difference between the test output image and ground truth images using a pre-trained VGG image classification network, thereby reflecting how a high frequency content is restored for perceptual satisfaction.

[0076] In some embodiments, quantization is applied to perform computations and store weights and biases at lower bit widths than a floating point precision. A quantized model applied in the method 600 or 650 executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, the image processing model 615 or 615’ is quantized according to a precision setting of the electronic device where the image processing model 615 or 615’ will be loaded. For example, the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 615 or 615’ are quantized based on the lower precision. The quantized image processing model 615 or 615’ result in a significant accuracy drop, and make image processing a lossy process. In some embodiments, the image processing model 615 or 615’ is re-trained with the quantized weights and biases to minimize a loss function L.

[0077] In some embodiments, weights and biases associated with filters of the image processing model 615 or 615’ maintain a float32 format, and are quantized based on a precision setting of the electronic device. For example, the weights and biases are quantized from the float32 format to an int8, uinl8. inti 6. or uintl6 format based on the precision setting of the electronic device. Specifically, in an example, the electronic device uses a CPU to run the image processing model 615 or 615’, and the CPU of the electronic device processes 32 bit data. The weights and biases of the image processing model 615 or 615’ are not quantized, and the image processing model 615 or 615’ is provided to the electronic device directly. In another example, the electronic device uses one or more GPUs to run the image processing model 615 or 615’, and the GPU(s) process 16 bit data. The weights and biases of the image processing model 615 or 615’ are quantized to an inti 6 format. In yet another example, the electronic device uses a DSP to run the image processing model 615 or 615’, and the DSP processes 8 bit data. The weights and biases of the image processing model 615 or 615’ are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 615 or 615’ have fewer MAC operations and smaller size, and are hardware-friendly during deployment on the electronic device.

[0078] In an example, weights and biases of an image processing model 615 or 615’ applied in the method 600 or 650 have a float32 format and are quantized to an uint8 format. Compared with the image processing model 615 or 615’ having the float32 format, the quantized image processing model 615 or 615’ only causes a loss of 0.2 dB on image information that is contained in the output image 504 created by super-resolution. However, the quantized image processing model 615 or 615’ is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 FPS.

[0079] The image processing model 615 or 615’ is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone). Architecture of the image processing model 615 or 615’ is designed according to the capabilities of the electronic device. In the present application, the image processing (i.e., VSR) method 600 or 650 is designed based on hardware friendly operations, 8-bit quantization aware training (QAT), and raw image data. Such VSR achieves a greater superresolution PSNR score, conserves mobile computing resources, and enhances a deployment efficiency of the image processing model 615 or 615’. Additionally, based on the image processing method 600 or 650, real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption. The image processing method 600 or 650 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR). The image processing model 615 or 615’ is robust to uint8 quantization.

[0080] In some embodiments, raw image data are directly applied to restore high- resolution clear images 604. More information could be exploited in a raw image domain because the raw image data are arranged in 10 or 12 bits. In contrast, RGB or YUV images produced by an ISP of an image are represented in 8 bits. The ISP contains nonlinear degradations, such as tone mapping and Gamma correction. Linear degradations (e.g., blurriness and noise) are nonlinear in the RGB or YUV domain, making it image restoration difficult. VSR in the raw image domain effectively avoids image restoration based on nonlinear degradations and generates the output image 604 with better image qualities compared with those restored in the RGB or YUV domain.

[0081] Figure 7 is a flow diagram of an image processing process 700 that increases an image resolution for ISR or VSR using an example SRRN 710, in accordance with some embodiments. The image processing process 700 is implemented by an electronic device (e.g., a data processing module 230 of a mobile phone 104C). The electronic device obtains an input image 702 having a first resolution (e.g., x IF) and generates an output image 704 having a second resolution (e.g., nH^nW, where n is a positive integer) that is greater than the first resolution. The input image 702 is optionally a static image or an image frame of a video clip. In some situations, the input image 702 is received via one or more communication networks 108 and in a user application 224, e.g., a social networking application, a social media application, a short video application, and a media play application. Examples of this user application 224 include, but are limited to, Tiktok, Kuaishou, WeChat, Tencent video, iQiyi, and Youku. Given a limited signal transmission bandwidth, a server 102 associated with the user application 224 streams low-resolution visual data including the input image 502 to electronic devices distributed at different client nodes. If displayed without ISR, the input image 502 would result in poor user experience for users of the user application 224. In an example, the input image 502 is part of a low- resolution video stream provided to unpaid users of a media play application. VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence. As such, the image processing process 700 uses low-resolution information of the input image 702 to predict missing information of the input image 702, which leads to a high- resolution video sequence including the output image 704.

[0082] The image processing process 700 is implemented based on an image processing model 706 that includes a feature extraction model 708 and an output conversion model 712 in addition to an SRRN 710. The SRRN 710 includes a plurality of residual block groups 710A-710B. The feature extraction model 708 is configured to extract an image feature map 714 from the input image 702. In an example, the feature extraction model 708 includes a 3x3 convolution layer followed by one ReLU layer. The SSRN 710 converts the image feature map 714 to an output feature map 716. The output conversion model 712 is coupled to an output of the plurality of residual block groups of the SRRN 710, and configured to convert the output feature map 716 generated by the SRRN 710 to the output image 704. In the output conversion model 712, a 3x3 convolution layer is followed by one ReLU layer to convert the output feature map 716 to an intermediate feature map 718. The image feature map 714 and intermediate feature map 718 are combined on an element-by- element basis and processed by a depth space model 524 (also called a pixel shuffle layer) to generate the output image 704 having the second resolution.

[0083] The SRRN 710 includes a first residual block group 710A and a second residual block group 710B coupled (e.g., directly or indirectly) to the first residual block group 710A. Each of the first and second residual block groups 710A and 710B is made from a plurality of residual blocks (e.g., 4 residual blocks RBI, RB2, RB3, and RB4, which have different weight values). The plurality of residual blocks RB1-RB4 are coupled in series according to a first order and a second order to form the first residual block group 710A and the second residual block group 710B, respectively. The second order is distinct from the first order. For example, the residual blocks RB1-RB4 are ordered to a sequence of residual blocks RBI, RB2, RB3, and RB4 in the first residual block group 710A and to another sequence of residual blocks RB2, RB3, RB4, and RBI in the second residual block group 710B.

[0084] In some embodiments, the plurality of residual blocks RB1-RB4 of the first residual block group 710A having the first order are shifted circularly by one residual block to form the second block group 710B having the second order. For example, referring to Figure 7, the residual blocks RB2, RB3, and RB4 are shifted left (i.e., clockwise) by one residual block, and the residual block RBI is moved to an end of the first residual block group 710A to form the second residual block group 71 OB. Further, in some embodiments not shown, the residual blocks RB3 and RB4 are shifted left (i.e., clockwise) by two residual blocks, and the residual blocks RBI and RB2 are moved to an end of the first residual block group 710A to form another residual block group 710 having an ordered sequence of residual blocks RB3, RB4, RBI, and RB2. In an example not shown, the residual blocks RBI, RB2, and RB3 are shifted right by one residual block, and the residual block RB4 is moved to a start of the first residual block group 710A to form a distinct residual block group having an ordered sequence of residual blocks RB4, RBI, RB2, and RB3. As such, in some embodiments, the plurality of residual blocks RB are shifted circularly in a clockwise or counter-clockwise direction by one or more residual blocks to change between two distinct residual block groups 710.

[0085] In some embodiments, each of the plurality of residual blocks RBI -RB4 includes an input interface 722, a first convolutional layer 724, an ReLU 726, a second convolution layer 728, an output interface 730, and a skip connection 732, and the skip connection 732 couples the input interface 722 to the output interface 730. An input feature is received via the input interface 722 and combined (e.g., by an element-wise sum) with an output feature of the second convolutional layer 728 to generate an output feature of the respective residual block at the output interface 730. In some embodiments, the plurality of residual blocks RBI -RB4 have the same network structure, but have different weight values for the same network structures (i.e., at least one weight of the same network structure has different values for any two of the plurality of residual blocks RBI -RB4).

[0086] In some embodiments, quantization is applied to perform computation and store weights and biases of the image processing model 706 at lower bit widths than a floating point precision. A quantized model executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, the image processing model 706 is quantized according to a precision setting of the electronic device where the image processing model 706 will be loaded. For example, the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 706 are quantized based on the lower precision. The quantized image processing model 706 results in a significant accuracy drop, and makes image processing a lossy process. In some embodiments, the image processing model 706 is re-trained with the quantized weights and biases to minimize a loss function L.

[0087] In some embodiments, weights and biases associated with filters of the image processing model 706 maintain a float32 format, and are quantized based on a precision setting of the electronic device. For example, the weights and biases are quantized from the float32 format to an int8, uinl8. inti 6. or uintl6 format based on the precision setting of the electronic device. Specifically, in an example, the electronic device uses a CPU to run the image processing model 706, and the CPU of the electronic device processes 32 bit data. The weights and biases of the image processing model 706 are not quantized, and the image processing model 706 is provided to the electronic device directly. In another example, the electronic device uses one or more GPUs to run the image processing model 706, and the GPU(s) process 16 bit data. The weights and biases of the image processing model 515 are quantized to an intl6 format. In yet another example, the electronic device uses a digital signal processor (DSP) to run the image processing model 706, and the DSP processes 8 bit data. The weights and biases of the image processing model 706 are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 706 have fewer MAC operations and smaller size, and are hardwarefriendly during deployment on the electronic device.

[0088] In an example, weights and biases of an image processing model 706 have a float32 format and are quantized to an uint8 format. Compared with the image processing model 706 having the float32 format, the quantized image processing model 706 only causes a loss of 0.2 dB on image information that is contained in the output image 704 created by super-resolution. However, the quantized image processing model 706 is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 FPS.

[0089] The image processing model 706 applied in the image processing process 700 is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone). Architecture of the image processing model 706 is designed according to the capabilities of the electronic device. In the present application, the image processing (i.e., VSR) process 700 is designed based on hardware friendly operations, e.g., using 8-bit quantization aware training (QAT) in a YUV domain. As such, the image processing process 700 is applicable in different image domains based on hardware capabilities.

[0090] In some embodiments, the input image 702 includes a raw image captured by an image sensor array. After converting the output feature map 716 to the output image 704, an ISP performs image processing operations on the output image 704 to generate an RGB color image. The image processing operations includes one or more of demosaicing, denoising, and auto functions. More information can be exploited in a raw domain, because the raw images are typically recorded in a 10 or 12 bit format, whereas a color image (e.g., an RGB or YUV image produced by ISP) are typically stored in an 8 bit format. Also, the raw images have not been processed by an image signal processor (ISP) that contains nonlinear processing (e.g., tone mapping, Gamma correction, blurring, noise filtering) in an RGB or YUV space, and therefore, are not impacted by difficulties caused by the non-linear processing to image or video restoration in the RGB/YUV domain. ISR or VSR of the raw images is configured to perform real-time inference efficiently on mobile devices.

[0091] Alternatively, in some embodiments, the image processing process 700 is applied in an RGB domain having R, G, and B components, which correspond to red, green, and blue colors of a given pixel size. The electronic device obtains a raw image captured by an image sensor array, and an ISP performs image processing operations on the raw image to generate the input image 702 for ISR or VSP using the image processing process 700. Alternatively and additionally, in some embodiments, the image processing process 700 is applied in a YUV domain. A YUV color model defines a color space in terms of one luma component (Y) and two chrominance components including U (blue projection) and V (red projection). YUV encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components. A plurality of video devices, therefore, render directly using YUV or luminance/chrominance images.

[0092] Based on the image processing method 700, real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption. The image processing method 700 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR). The image processing model 706 applied in the image processing process 700 is robust to uint8 quantization and corresponds to only 0.2dB PSNR drop when compared with a float32 model built on DIV2K validation dataset. Moreover, VSR is implemented in the YUV domain, improving both signal quality, structural similarity, and visual perception of the input image 702 and model inference abilities of the image processing model 706.

[0093] In some embodiments, the electronic device sets the first and second orders for the first and second residual block groups 710A and 71 OB of the SRRN 710, respectively, and trains the SRRN 710 to determine weights of each of the plurality of residual blocks RB1-RB4. In some embodiments, recursive supervision learning and random shuffling are applied.

[0094] Figures 8A-8C are block diagrams of three example SRRNs 710 having three distinct shuffling schemes 800, 820, and 840 of residual blocks RBI -RB4, in accordance with some embodiments. In each of the distinct shuffling schemes 800, 820, and 840, the respective SRRN 710 includes a respective sequence of successive residual block groups made from a plurality of residual blocks RB1-RB4. In each residual block group 710 of the sequences of successive residual block groups in the SRRNs 710, the plurality of residual blocks are coupled in series according to a distinct order. For example, for the shuffling scheme 800, the respective SRRN 710 includes a sequence of successive residual block groups 710A, 710B, 710C, and 710D. For the shuffling scheme 820, the respective SRRN 710 includes a sequence of successive residual block groups 710A, 710D, 710C, and 710B. For the shuffling scheme 840, the respective SRRN 710 includes a sequence of successive residual block groups 710A, 710E, 71 OF, and 710G.

[0095] Each of the residual block groups 710A-710F includes the same residual blocks RBI, RB2, RB3, and RB4 arranged in a respective distinct order. In some embodiments, referring to Figure 8 A, the plurality of residual blocks RB1-RB4 of a first residual block group 710A are shifted clockwise by one residual block to form a second residual block group 710B, and a first residual block RBI is moved to an end of the second residual block group 710B. The residual blocks of the second residual block group 710B are shifted clockwise by one residual block to form a third residual block group 710C, and a second residual block RB2 is moved to an end of the third residual block group 710C. The residual blocks of the third residual block group 710C are shifted clockwise by one residual block to form the fourth residual block group 710D, and a third residual block RB3 is moved to an end of the fourth residual block group 710D. Alternatively, in some embodiments, referring to Figure 8B, the plurality of residual blocks RB1-RB4 of the first residual block group 710A are shifted counter-clockwise by one residual block to form the fourth residual block group 710D, and a fourth residual block RB4 is moved to a start of the second residual block group 710B. The residual blocks of the fourth residual block group 710D are shifted counter-clockwise by one residual block to form the third residual block group 7 IOC, and the third residual block RB3 is moved to a start of the third residual block group 7 IOC. The residual blocks of the third residual block group 7 IOC are shifted counter-clockwise by one residual block to form the second residual block group 71 OB, and the second residual block RB2 is moved to a start of the second residual block group 710B.

[0096] In some embodiments, referring to Figure 8C, the plurality of residual blocks RB1-RB4 are randomly shuffled to form a sequence of successive residual block groups 710A, 710E, 71 OF, and 710G. The first residual block group 710A includes a sequence of residual blocks RBI -RB4, and is immediately followed by a fifth residual block group 710E including a sequence of residual blocks RB2, RB4, RBI, and RB3. The fifth residual block group 710E is immediately followed by a sixth residual block group 71 OF including a sequence of residual blocks RB4, RB2, RB3, and RBI. The sixth residual block group 710F is immediately followed by a seventh residual block group 710G including a sequence of residual blocks RB3, RB2, RBI, and RB4. As such, referring to Figures 8A-8C, in these embodiments, each SRRN 710 consists of 16 residual blocks having the same network structure (e.g., two convolutional layers 724 and 728 coupled by an ReLU 726 and a skip connection 732 in Figure 700), while having only four different sets of weight values corresponding to the residual blocks RBI, RB2, RB3, and RB4.

[0097] ISR and VSR are computer vision techniques widely applied in applications of mobile devices. There are several major issues that prevent the straightforward deployment of neural networks on mobile devices. For example, a mobile device has a restricted amount of RAM that cannot support many common deep learning layers, and limited FLOPs for mobile SoC constrains neural network performance and power consumption of mobile devices. In some embodiments, an image processing network includes a sequence of 16 residual blocks each two of which have distinct weight values from each other. Overfitting is highly likely due to network depth increasing and more data are required to train such an image processing network, and this he network becomes too huge to be stored and retrieved for mobile devices. In some embodiments, an image processing network includes a sequence of four identical residual block groups having four distinct residual blocks. The number of weights in this image processing network is controlled while increasing a network depth, and however, the same residual block group is used four times which reduces a model generalization ability.

[0098] Conversely, compared with the 16 distinct residual blocks or four identical residual block groups, the SRRN 710 requires a reasonable number of weight values for efficient ISR and VSR, controls model parameters while increasing the depth applies based on recursive learning, and increases the network generalization ability using a residual block shuffling technique. The image processing process 700 is implemented with only a portion (e.g., ¥2) of the FLOPS applied to the image processing networks using 16 distinct residual blocks, and achieves a comparable (e.g., only slightly lower) super-resolution PSNR score. In some embodiments, the image processing process 700 is implemented with similar FLOPS applied to the image processing network using the sequence of four identical residual block groups having four distinct residual blocks, and achieves a greater super-resolution PSNR score. In some embodiments, the PSNR corresponding to the SSRN 710 is further enhanced by using residual blocks RB1-RB4 having larger sizes and/or a more effective loss function. [0099] In some embodiments, each residual block group 710 includes all of the plurality of residual blocks used to form the sequence of successive residual block groups. Alternatively, in some embodiments, each residual block group 710 includes a respective subset (e.g., less than all) of the plurality of residual blocks used to form the sequence of successive residual block groups. For example, the plurality of residual blocks includes five residual blocks EB1-EB5, and each residual block group 710 includes four of the plurality of residual blocks EB1-EB5. The plurality of residual blocks EB1-EB5 of the first residual block group 710A having the first order are shifted circularly by one residual block to form the second residual block group 710B having the second order, while each residual block group 710 includes the first four residual groups. In an example, the first residual block group 710A includes a ordered sequence of residual blocks EB1, EB2, EB3, and EB4, and the second residual block group 710B includes a ordered sequence of residual blocks EB5, EB1, EB2, and EB3. Additionally, a third residual block group 710C includes an ordered sequence of residual blocks EB4, EB5, EB1, and EB2.

[00100] Figure 9 is a block diagram of an example neural network 900 for determining distinct orders of residual blocks in residual block groups of an SRRN 710, in accordance with some embodiments. The SRRN 710 includes a sequence of successive residual block groups (e.g., 710A-710D in Figure 8A) made from a plurality of residual blocks (e.g., EB1- EB4 in Figure 8 A). In each residual block group in the SRRNs 710, the plurality of residual blocks are coupled in series according to a distinct order of the plurality of residual blocks. The plurality of residual block groups includes a first residual block group 710A and a second residual block group 710B. The distinct orders of residual blocks in the first and second residual block groups 710A and 710B are selected based on the neural network 900. [00101] Specifically, in some embodiments, the plurality of residual blocks includes four residual blocks EB1, EB2, EB3, and EB4. These four residual blocks EB1-EB4 are ordered into 24 distinct sequences of residual blocks in 24 residual block groups 902. The 24 distinct sequences of residual blocks of the 24 residual block groups 902 are arranged in parallel to receive an image feature map 904, which is generated from a test image 906 using a feature extraction model 708. Each residual block group 902A-902N generates a respective feature map 908A-908N from the image feature map 904. Respective feature maps 908A- 908N are compared to each other to determine their similarity levels. In accordance with a determination that the first and second residual block groups 710A and 710B have the smallest similarity level among any two of the plurality of residual block groups, the first and second residual block groups 710A and 710B are selected for the SRRN 710. Additionally, the four residual block groups having the smallest similarity levels are selected to form the SRRN 710. In some embodiments, the selected residual block groups are organized according to different orders in different candidate SRRNs 710 and trained to obtain respective losses. An SRRN 710 having the smallest loss is applied to generate the output image 704 from the input image 702.

[00102] In some embodiments not shown, the plurality of residual blocks includes five residual blocks EB1, EB2, EB3, EB4, and EB5 (not shown). Each residual block group 902 includes only four residual groups selected from the plurality of residual blocks. Four out of the five residual blocks EB1-EB5 are selected and ordered into 120 distinct sequences of residual blocks in 120 residual block groups 902. A subset of these 120 residual block groups 902 is selected to form the SRRN 710 based on their corresponding similarity levels. In an example, four residual block groups 902 having the smallest similarity levels are selected. In another example, six residual block groups 902 having the smallest similarity levels are selected to form the SRRN 710.

[00103] Figure 10 is a flow diagram of an image processing process 1000 that increases an image resolution for ISR or VSR using a residual block based network 1010, in accordance with some embodiments. The image processing process 1000 is implemented by an electronic device (e.g., a data processing module 230 of a mobile phone 104C). The electronic device obtains a current input image 1002C having a first resolution (e.g., x IF) and generates a current output image 1004C having a second resolution (e.g., nH nW, where n is a positive integer) that is greater than the first resolution. The current input image 1002C is an image frame of a video clip. In some situations, the current input image 1002C is received via one or more communication networks 108 and in a user application 224, e.g., a social networking application, a social media application, a short video application, and a media play application. Examples of this user application 224 include, but are limited to, Tiktok, Kuaishou, WeChat, Tencent video, iQiyi, and Youku. Given a limited signal transmission bandwidth, a server 102 associated with the user application 224 streams low- resolution visual data including the input image 502 to electronic devices distributed at different client nodes. If displayed without VSR, the current input image 1002C would result in poor user experience for users of the user application 224. In an example, the current input image 1002C is part of a low-resolution video stream provided to unpaid users of a media play application. VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence. As such, the image processing process 1000 uses low- resolution information of the current input image 1002C to predict missing information of the current input image 1002C, which leads to a high-resolution video sequence including the current output image 1004C.

[00104] The current input image 1002C follows a prior input image 1002P in a sequence of image frames having a first resolution. The electronic device obtains the prior input image 1002P and the current input image 1002C, and applies the residual block based network 1010 to generate a prior output feature 1006P corresponding to the prior input image 1002P. The current input image 1002C and prior output feature 1006P are combined to generate a current input feature 1008C. The residual block based network 1010 is applied to generate a current output feature 1006C from the current input feature 1008C. The current output feature 1006C is converted to the current output image 1004C having the second resolution. The current output image 1004C has the same image content with the current input image 1002C, except that the current output image 1004C has a higher resolution than the current input image 1002C. In some situations, the prior input image 1002P immediately precedes the current input image 1002C. Alternatively, in some situations, the prior input image 1002P is separated from the current input image 1002C by one or more input images. [00105] In some embodiments, the prior output feature 1006P is generated based on the prior input image 1006P using the residual block based network 1010. The current input image 1002P has the first resolution, and the prior output feature 1006P has a resolution different from the first resolution. In some embodiments, a resolution of the current input image 1002C is converted from the first resolution to the resolution of the prior output feature 1006P that is generated from the prior input image 1002P. Alternatively, in some embodiments, the resolution of the prior output feature 1006P is converted to the first resolution of the current input image 1002C. Additionally and alternatively, in some embodiments, the resolution of the prior output feature 1006P and the first resolution of the current input image 1002C are converter to an alternative resolution distinct from the first resolution and the resolution of the prior output feature 1006P. After such resolution matching, the current input image 1002C and the prior output feature 1006P are concatenated to generate a concatenated input image 1012. The current input feature 1008C is extracted from the concatenated input image 1012, e.g., using a feature extraction model 1014 including a 3x3 convolutional layer and an ReLU.

[00106] The residual block based network 1010 converts the current input feature 1008C to a network output feature 1016. The current input feature 1008C and network output feature 1016 are combined on an element-by-element basis, and processed by an output conversion module 1018. For example, in the output conversion module 1018, a 3x3 convolution layer is followed by one ReLU layer to convert a combination of the current input feature 1008C and network output feature 1016 to the current output feature 1006C. A depth space model 1020 (also called a pixel shuffle layer) is applied to generate the current output image 1004C having the second resolution based on the current output feature 1006C. [00107] In other words, the current input image 1002C is processed to enhance the first resolution of this current input image 1002C based on one or more prior input images 1002P that precede the current input image 1002C in a sequence of image frames. In some embodiments, the prior input image 1002P including a first input image, and the prior output feature 1006P including a first prior output feature. The electronic device obtains a second input image (not shown), and the first input image follows the second input image. The residual block based network 1010 is applied to generate a second prior output feature 1006P’ based on the second input image. The current input image 1002C is combined with both the first and second prior output features 1006P and 1006P’ to generate the current input feature 1012. In some situations, the second input image immediately precedes the first input image. Alternatively, in some situations, the second input image is separated from the first input image by one or more input images. Additionally, in some embodiments, the sequence of image frames includes a plurality of successive groups of pictures (GOPs). The same GOP includes both the current input image 1002C and the one or more prior input images 1002P that provide the output features 1006P and 1006P’ to generate the current input feature 1008C.

[00108] In some embodiments, quantization is applied to perform computation and store weights and biases of the image processing model 1022 at lower bit widths than a floating point precision. A quantized model executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, an image processing model 1022 includes the feature extraction model 1014, residual block based network 1010, output conversion module 1018, and depth space model 1020. The image processing model 1022 is quantized according to a precision setting of the electronic device where the image processing model 1022 will be loaded. For example, the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 1022 are quantized based on the lower precision. The quantized image processing model 1022 result in a significant accuracy drop, and make image processing a lossy process. In some embodiments, the image processing model 1022 is re-trained with the quantized weights and biases to minimize a loss function L.

[00109] In some embodiments, weights and biases associated with filters of the image processing model 1022 maintain a float32 format, and are quantized based on a precision setting of the electronic device. For example, the weights and biases are quantized from the float32 format to an int8, uinl8. inti 6. or uintl6 format based on the precision setting of the electronic device. Specifically, in an example, the electronic device uses a CPU to run the image processing model 1022, and the CPU of the electronic device processes 32 bit data. The weights and biases of the image processing model 1022 are not quantized, and the image processing model 1022 is provided to the electronic device directly. In another example, the electronic device uses one or more GPUs to run the image processing model 1022, and the GPU(s) process 16 bit data. The weights and biases of the image processing model 1022 are quantized to an intl6 format. In yet another example, the electronic device uses a DSP to run the image processing model 1022, and the DSP processes 8 bit data. The weights and biases of the image processing model 1022 are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 1022 have fewer MAC operations and smaller size, and are hardware-friendly during deployment on the electronic device.

[00110] In an example, weights and biases of the image processing model 1022 have a float32 format and are quantized to an uint8 format. Compared with the image processing model 1022 having the float32 format, the quantized image processing model 1022 causes a smaller loss on image information that is contained in the output image 1004C created by super-resolution. However, the quantized image processing model 1022 is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 FPS. [00111] The image processing model 1022 applied in the image processing process 1000 is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone). Architecture of the image processing model 706 is designed according to the capabilities of the electronic device. In the present application, the image processing (i.e., VSR) process 1000 is designed based on hardware friendly operations, e.g., using 8-bit quantization aware training (QAT) in a YUV domain. As such, the image processing process 1000 is applicable in different image domains based on hardware capabilities.

[00112] In some embodiments, the input image 1002 includes a raw image captured by an image sensor array. After converting the output feature 1006 to the output image 1004, an ISP performs image processing operations on the output image 1004 to generate an RGB color image. The image processing operations includes one or more of demosaicing, denoising, and auto functions. Alternatively, in some embodiments, the image processing process 1000 is applied in an RGB domain having R, G, and B components, which correspond to red, green, and blue colors of a given pixel size. The electronic device obtains a raw image captured by an image sensor array, and an ISP performs image processing operations on the raw image to generate the input image 1002 for ISR or VSP using the image processing process 1000. Alternatively and additionally, in some embodiments, the image processing process 1000 is applied in a YUV domain. A YUV color model defines a color space in terms of one luma component (Y) and two chrominance components including U (blue projection) and V (red projection). YUV encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components. A plurality of video devices, therefore, render directly using YUV or luminance/chrominance images.

[00113] Based on the image processing process 1000, real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption. The image processing process 1000 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS or above, and particularly, outperforms state-of- the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR). The image processing model 1022 applied in the image processing process 1000 is robust to uint8 quantization and corresponds to a small PSNR drop when compared with a float32 model built on DIV2K validation dataset. Moreover, VSR is implemented in the YUV domain, improving both signal quality, structural similarity, and visual perception of the input image 1002 and model inference abilities of the image processing model 1022. [00114] In some embodiments, the image processing model 1022 includes a real-time raw VSR model, which is configured based on different operations of a meta-node latency on NPU. The raw VSR model includes three parts: a feature extraction model 1014 for shallow feature extraction, residual block based network 1010, and an upscale module including an output conversion module 1018 and depth space model 1020. In an example, the feature extraction model 1014 includes a 3x3 convolution layer (pad 1, stride 2, and channel 16) followed by an ReLU layer, and is configured to extract and downsample shallow features. In the residual block based network 1010, a residual block includes four 3x3 convolution layers (pad 1, stride 1, and channel 16) followed by one ReLU layer, and is configured to extract the mid-level and high-level features. The upscale module includes a 3x3 convolution layer (pad 1, stride 1, and channel 16) followed by a pixel shuffle layer. This raw VSR model is configured to convert low-resolution raw images (e.g., prior input image 1002P and current input image 1002C) and super-solve the raw images with a factor (e.g., equal to 2). Such realtime raw VSR can be implemented on a mobile device, and takes a shortened duration of time (e.g., 30ms) to convert an example raw image having a size of 3Mb and a resolution of 2000x 1500 to another raw image having a size of 12Mb and a resolution of 4000x3000. [00115] Figure 11 A is a block diagram of a residual block 1100, in accordance with some embodiments. Figure 1 IB is a block diagram of an example residual block based network 1010 including a sequence of residual blocks, in accordance with some embodiments. Figure 11C is a block diagram of another example residual block based network 1010 including a sequence of identical residual block groups, in accordance with some embodiments. Figure 1 ID is a block diagram of another example residual block based network 1010 including an SRRN 710 (Figure 7), in accordance with some embodiments. The residual block based network 1010 includes a plurality of residual blocks 1100. In some embodiments, referring to Figure 11A, each residual block 1100 includes an input interface 1102, a first convolutional layer 1104, an ReLU 1106, a second convolution layer 1108, an output interface 1110, and a skip connection 1112, and the skip connection 1112 couples the input interface 1102 to the output interface 730. An input feature is received via the input interface 1102 and combined (e.g., by an element-wise sum) with an output feature of the second convolutional layer 11088 to generate an output feature of the respective residual block 1100 at the output interface 1110. In some embodiments, the plurality of residual blocks in the residual block based network 1010 have the same network structure, but have different weight values for the same network structure (i.e., at least one weight of the same network structure has different values for any two of the plurality of residual blocks).

[00116] Referring to Figure 1 IB, in some embodiments, the residual block based network 1010 includes an input interface 10101, an output interface 101 OO, and a plurality of distinct residual blocks RB1-RBN that are coupled in series between the input and output interfaces 10101 and 10100, and the input interface 10101 is coupled to the output interface 10100 via a skip connection 1122. The plurality of residual blocks RB1-RBN include TV residual blocks, where TV is a positive integer that is greater than 1. In some embodiments, the plurality of residual blocks RB1-RBN have the same network structure of the residual block 1100 including two convolutional layers 1104 and 1108 coupled by an ReLU 1106 and a skip connection 1112 in Figure 11 A, but have N different sets of weight values corresponding to the N residual blocks RB1-RBN. Alternatively, in some embodiments, a subset of the plurality of residual blocks RB1-RBN (e.g., two residual blocks) are identical, e.g., have the same network structure of the residual block 1100 and apply the same sets of weight values. Additionally, in some embodiments, any two of the subset of residual blocks that are identical to each other are not adjacent to each other in the residual block based network 1010, and separated by at least one residual block.

[00117] Referring to Figure 11C, in some embodiments, the residual block based network 1010 includes an input interface 10101, an output interface 101 OO, and a plurality of identical residual block groups 1142 that are coupled in series between the input and output interfaces 10101 and 10100, and the input interface 10101 is coupled to the output interface 10100 via a skip connection 1144. Each identical residual block group 1142 includes a plurality of distinct residual blocks RBI -RBN that are coupled in series according to a fixed order. For example, the residual block based network 1010 include AT residual block groups, where is a positive integer greater than 1, and the plurality of distinct residual blocks RBI - RBN includes N residual blocks, where Ais a positive integer greater than 1. Each of the plurality of distinct residual blocks RBI -RBN has the same network structure as the residual block 1100. For every two of the plurality of distinct residual blocks RBI -RBN, at least one weight has different values in the respective two distinct residual blocks. In some embodiments, both M and N are equal to 4. The residual block based network 1010 includes four residual block groups 1142, and each identical residual block group 1142 has four distinct residual blocks RBI -RB4. Alternatively, in some embodiments, is equal to an integer that is greater than 4, and N is equal to 4. [00118] Referring to Figure 1 ID, in some embodiments, the residual block based network 1010 corresponds to an SRRN (e.g., SRRN 710 in Figure 7), and includes an input interface 10101, an output interface 10100, and a plurality of residual block groups 1162 that are coupled in series between the input and output interfaces 10101 and 10100. The input interface 10101 is coupled to the output interface 10100 via a skip connection 1164. Each residual block group 1162 is made of a plurality of distinct residual blocks RB1-RBN that are coupled in series according to a respective distinct order. For example, the residual block based network 1010 include AT residual block groups, where M is a positive integer greater than 1, and each the plurality of distinct residual blocks 1162 includes N residual blocks, where TV is a positive integer greater than 1. In some embodiments, both M and N are equal to 4. The residual block based network 1010 includes four residual block groups 1142, and each identical residual block group 1142 has four distinct residual blocks RB1-RB4. Alternatively, in some embodiments, AT is equal to an integer that is greater than 4, and N is equal to 4.

[00119] In some embodiments, in the residual block groups 1162, each of the plurality of distinct residual blocks RBI -RBN has the same network structure as the residual block 1100. For every two of the plurality of distinct residual blocks RB1-RBN, at least one weight has different values in the respective two distinct residual blocks. Any two of the plurality of residual block groups 1162 include the same residual blocks RBI -RBN, but have different residual block orders. In an example, in two of the plurality of residual block groups 1162 (e.g., residual block groups 1162-1 and 1162-M), a pair of residual blocks RBI and RB2 swap their positions, and remaining residual blocks are identical.

[00120] Figure 12 is a flow diagram of an image processing process 1200 that increases an image resolution of a prior input image 1002P for ISR or VSR using a residual block based network 1010, in accordance with some embodiments. As explained above, a sequence of image frames includes a plurality of successive groups of pictures (GOPs). The same GOP includes both the current input image 1002C and the one or more prior input images 1002P that provide their corresponding output features 1006P to generate the current input feature 1008C. Before the current input image 1002C is processed, the one or more prior input images are processed to provide one or more prior output features. In some embodiments, the prior input image 1002P includes a first input image, and the prior output feature 1006P includes a first prior output feature. A second input image 1002P’ precedes the first input image 1002P. The residual block based network 1010 is applied to generate a second prior output feature 1006P’ based on the second input image 1002P’. The prior input image 1002P and second prior output feature 1006P’ are combined to generate a prior input feature 1008P. The residual block based network 1010 is applied to generate the first prior output feature 1006P from the prior input feature 1008P. The prior output feature 1006P is converted to the prior output image 1004P having the second resolution. The prior output image 1004P has the same image content with the prior input image 1002P, except that the prior output image 1002P has a higher resolution than the prior input image 1002P.

[00121] In some embodiments, the prior input image 1002P corresponds to an input image leading the sequence of image frames or leading a GOP, and no input image precedes the prior input image 1002P in the sequence of image frames or GOP. The electronic device creates an initial feature map 1202, and all elements of the initial feature map 1202 equal to 0. The prior input image 1002P and the initial feature map 1202 are combined to generate a prior input feature 1008P. The prior output feature 1006P is generated from the prior input feature 1008P using the residual block based network 1010.

[00122] Figure 13 is a flow diagram of an image processing process 1300 that increases an image resolution of a next input image 1002N for ISR or VSR using a residual block based network 1010, in accordance with some embodiments. A next input image 1002N follows the current input image 1002C. In some situations, the current input image 1002C immediately precedes the next input image 1002N. Alternatively, in some situations, the current input image 1002C is separated from the next input image 1002N by one or more input images. The electronic device obtains the next input image 1002N that follows the current input image 1002C in the sequence of image frames (e.g., in the same GOP of the sequence of image frames). The next input image 1002N and the current output feature 1006C are combined to generate a next input feature 1008N. The residual block based network 1010 is applied to generate a next output feature 1006N based on the next input feature 1008N. The next output feature 1006N is converted to a next output image 1004N having the second resolution.

[00123] Figure 14 is a flow diagram of another example image processing process 1400 that increases an image resolution for ISR or VSR using a residual block based network 1410, in accordance with some embodiments. The image processing process 1400 is implemented by an electronic device (e.g., a data processing module 230 of a mobile phone 104C). The electronic device obtains a prior input image 1402P and a current input image 1402C that follows the prior input image 1402P in a sequence of image frames having a first resolution (e.g., pp). in some situations, the prior input image 1402P immediately precedes the current input image 1402C. Alternatively, in some situations, the prior input image 1402P is separated from the current input image 1402C by one or more input images. For example, the two images has a predefined temporal separation. In some embodiments, a GOP of the sequence of image frames include both the prior and current input images 1402P and 1402C. [00124] The residual block based network 1410 is applied to generate a prior output image 1404P based on at least the prior input image 1402P, and the prior output image 1404P has a second resolution greater than the first resolution. A current optical flow map 1406C represents image motion between the prior and current input images 1402P and 1402C. A first output image 1408 is predicted from the prior output image 1404P and current optical flow map 1406C. Stated another way, the image motion is determined between the prior and current input images 1402P and 1402C, which have the first resolution, and assumed to remain substantially unchanged after resolutions of the prior and current input images 1402P and 1402C increase. As such, the prior output image 1404P is shifted by this visual variation between the prior and current input images 1402P and 1402C to predict the current output image 1404C.

[00125] The first output image 1408 and the current input image 1402C are combined to generate a combined input image 1412. The electronic device applies the residual block based network 1410 to generate a current output feature 1414C based on the combined input image 1412, and converts the current output feature 1414C to a current output image 1404C having the second resolution. The current output image 1404C has the same image content with the current input image 1402C.

[00126] In some embodiments, an optical flow network 1416 is applied to generate the current optical flow map 1406C from the prior and current input images 1402P and 1402C. The current optical flow map 1406C includes a plurality of elements. In some embodiments, each element of the current optical flow map 1406C corresponds to a respective object and represents an object-based image motion value between a respective pixel of the prior input image 1402P and a corresponding pixel of the current input image 1402C. The current optical flow map 1406C has the first resolution. Alternatively, in some embodiments, each element of the current optical flow map 1406C corresponds to a respective object that are associated with a respective first set of neighboring pixels (e.g., 3x3 pixels) of the prior input image 1402P and a respective second set of neighboring pixels of the current input image 1402C. Each element of the current optical flow map 1406C represents an average image motion value of the respective object between the respective first set of neighboring pixels and the respective second set of neighboring pixels. The current optical flow map 1406C has a resolution smaller than the first resolution, except that the current output image 1404C has a higher resolution than the current input image 1402C. [00127] Additionally, in some embodiments, the optical flow network 1416 includes an encoder-decoder network. In an example, the encoder-decoder network is a U-net having skip connections. Alternatively, in another example, the encoder-decoder network is a U-net having no skip connection.

[00128] In some embodiments, the current optical flow map 1406 includes a first optical flow map. The prior output image 1404P was previously generated based on the prior input image 1402P, and has a resolution equal to the second resolution that is greater than the first resolution. Further, in some embodiments, a resolution of the first optical flow map is increased, e.g., from the first resolution, to the second resolution to generate a second optical flow map having the second resolution. The first output image 1408 is down-sampled from the second resolution to the first resolution. The first output image 1408 is predicted from the prior output image 1404P and the second optical flow map, and the down-sampled first output image 1408 has the same resolution as, and is combined with, the current input image 1402C to generate the combined input image 1412. In an example, after resolutions of the first output image 1408 and current input image 1402 are normalized, the first output image 1408 and current input image 1402 are concatenated to generate the combined input image 1412.

[00129] The combined input image 1412 is processed by an image processing model 1418 that includes a feature extraction model 1420 and an output conversion model 1422 in addition to the residual block based network 1410. The feature extraction model 1420 is configured to extract an image feature map 1424 from the combined input image 1412. In an example, the feature extraction model 1420 includes a 3^3 convolution layer followed by one ReLU layer. The residual block based network 1410 converts the image feature map 1424 to a current output feature 1426. The output conversion model 1422 is coupled to an output of the residual block based network 1410, and configured to convert the current output feature 1426 generated by the residual block based network 1410 to the current output image 1404C. In the output conversion model 1422, a 3x3 convolution layer 1428 is followed by one ReLU layer to convert the current output feature 1426 to an intermediate feature map 1430, which is processed by a depth space model 1432 (also called a pixel shuffle layer) to generate the current output image 1404C having the second resolution. In some embodiments, the image feature map 1424 and intermediate feature map 1430 are combined on an element-by-element basis and processed by the depth space model 1432 to generate the current output image 1404C. [00130] In some embodiments, the prior and current input images 1402P and 1402C are two distinct image frames of a video clip. Further, in some embodiments, the prior and current input images 1402P and 1402C belong to a GOP of the video clip. In some situations, the video clip is received via one or more communication networks 108 and in a user application 224, e.g., a social networking application, a social media application, a short video application, and a media play application. Examples of this user application 224 include, but are limited to, Tiktok, Kuaishou, WeChat, Tencent video, iQiyi, and Youku. Given a limited signal transmission bandwidth, a server 102 associated with the user application 224 streams low-resolution visual data including the input image 502 to electronic devices distributed at different client nodes. In an example, the current input image 1402C is part of a low-resolution video stream provided to unpaid users of a media play application. VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence. As such, the image processing process 140 uses low-resolution information of the current input image 1402C to predict missing information of the current input image 1402C, which leads to a high-resolution video sequence including the current output image 1404C.

[00131] In some embodiments, quantization is applied to perform computation and store weights and biases of the optical flow network 1416 and image processing model 1418 at lower bit widths than a floating point precision. A quantized model executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, an image processing model 1418 includes the feature extraction model 1420, residual block based network 1410, and output conversion model 1422. The image processing model 1418 is quantized according to a precision setting of the electronic device where the image processing model 1418 will be loaded. For example, the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 1418 are quantized based on the lower precision. The quantized image processing model 1418 result in a significant accuracy drop, and make image processing a lossy process. In some embodiments, the image processing model 1418 is retrained with the quantized weights and biases to minimize the loss function L. Likewise, the optical flow network 1416 is quantized and retrained based on a precision of an electronic device configured to run the image processing process 1400. Such quantization-aware training simulates low precision behavior in a forward pass, while a backward pass remains the same, which induces a quantization error which is accumulated in a loss function L. [00132] In some embodiments, weights and biases associated with filters of the image processing model 1418 maintain a float32 format, and are quantized based on a precision setting of the electronic device. For example, the weights and biases are quantized from the float32 format to an int8, uinl8. inti 6. or uintl6 format based on the precision setting of the electronic device. Specifically, in an example, the electronic device uses a CPU to run the image processing model 1418, and the CPU of the electronic device processes 32 bit data. The weights and biases of the image processing model 1418 are not quantized, and the image processing model 1418 is provided to the electronic device directly. In another example, the electronic device uses one or more GPUs to run the image processing model 1418, and the GPU(s) process 16 bit data. The weights and biases of the image processing model 1418 are quantized to an intl6 format. In yet another example, the electronic device uses a DSP to run the image processing model 1418, and the DSP processes 8 bit data. The weights and biases of the image processing model 1418 are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 1418 have fewer MAC operations and smaller size, and are hardware-friendly during deployment on the electronic device.

[00133] In an example, weights and biases of an image processing model 1418 have a float32 format and are quantized to an uint8 format. Compared with the image processing model 1418 having the float32 format, the quantized image processing model 1418 only causes a small (e.g., negligible) loss on image information that is contained in the output image 1004C created by super-resolution. However, the quantized image processing model 1418 is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 FPS.

[00134] The image processing model 1418 applied in the image processing process 1400 is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone). Architecture of the image processing model 1418 is designed according to the capabilities of the electronic device. In the present application, the image processing (i.e., VSR) process 1400 is designed based on hardware friendly operations, e.g., using 8-bit quantization aware training (QAT) in a YUV domain. As such, the image processing process 1400 is applicable in different image domains based on hardware capabilities. [00135] In some embodiments, the current input image 1402 includes a raw image captured by an image sensor array. After converting the current output feature 1426 to the current output image 1404C, an ISP performs image processing operations on the current output image 1404C to generate an RGB color image. The image processing operations includes one or more of demosaicing, denoising, and auto functions. Alternatively, in some embodiments, the image processing process 1400 is applied in an RGB domain having R, G, and B components, which correspond to red, green, and blue colors of a given pixel size. The electronic device obtains a raw image captured by an image sensor array, and an ISP performs image processing operations on the raw image to generate the input image 1002 for ISR or VSP using the image processing process 1400. Alternatively and additionally, in some embodiments, the image processing process 1400 is applied in a YUV domain. A YUV color model defines a color space in terms of one luma component (Y) and two chrominance components including U (blue projection) and V (red projection). YUV encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components. A plurality of video devices, therefore, render directly using YUV or luminance/chrominance images.

[00136] Based on the image processing process 1400, real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption. The image processing process 1400 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR). The image processing model 1418 applied in the image processing process 1400 is robust to uint8 quantization and corresponds to a negligible PSNR drop when compared with a float32 model built on DIV2K validation dataset. Moreover, VSR is implemented in the YUV domain, improving both signal quality, structural similarity, and visual perception of the current input image 1402C and model inference abilities of the image processing model 1418. [00137] In some embodiments, the image processing model 1418 includes a real-time raw VSR model, which is configured based on different operations of a meta-node latency on NPU. The raw VSR model includes three parts: a feature extraction model 1420 for shallow feature extraction, residual block based network 1410, and an upscale module including an output conversion model 1422. In an example, the feature extraction model 1420 includes a 3x3 convolution layer (pad 1, stride 2, and channel 16) followed by an ReLU layer, and is configured to extract and downsample shallow features. In the residual block based network 1410, a residual block includes four 3x3 convolution layers (pad 1, stride 1, and channel 16) followed by one ReLU layer, and is configured to extract mid-level and high-level features. The upscale module includes a 3x3 convolution layer (pad 1, stride 1, and channel 16) followed by a pixel shuffle layer. This raw VSR model is configured to convert low- resolution raw images (e.g., prior input image 1402P and current input image 1402C) and super-solve the raw images with a factor (e.g., equal to 2). Such a real-time raw VSR can be implemented on a mobile device, and takes a shortened duration of time (e.g., 30ms) to convert an example raw image having a size of 3Mb and a resolution of 2000x 1500 to another raw image having a size of 12Mb and a resolution of 4000x3000.

[00138] Figure 15 is a flow diagram of another example image processing process 1500 that increases an image resolution based on an optical flow map 1406, in accordance with some embodiments. A sequence of image frames includes a prior input image 1402P and a current input image 1402C that follows the prior input image 1402P. Both the first and current input images 1402P and 1402C have a first resolution. A current optical flow map 1406C describes image motion (e.g., object-based image motion) between the prior and current input images 1402P and 1402C. In some embodiments, the current optical flow map 1406 is determined from the prior and current input images 1402P and 1402C using an optical flow network 1416 (e.g., a U-net). An electronic device applies a residual block based network 1410 to generate a prior output image 1404P based on the prior input image 1402P, and the prior output image 1404P has a second resolution greater than the first resolution. A first output image 1408 is predicted from the prior output image 1404P and a current optical flow map 1406C. The current input image 1402C and the first output image 1408 are combined to generate a combined input image, which is processed by an image processing model 1418 including the residual block based network 1410 to generate a current output feature 1426. The current output feature 1426 is converted to a current output image 1404C having the second resolution.

[00139] In some embodiments, the residual block based network 1410 includes an input interface, an output interface, and a plurality of distinct residual blocks 1502 that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection 1504. For example, each of the plurality of residual blocks 1502 includes an input interface, a first convolutional layer, an ReLU, a second convolution layer, an output interface, and a skip connection, and the skip connection couples the input interface to the output interface of the respective residual block 1502. Alternatively, in some embodiments not shown in Figure 15, the residual block based network 1410 includes an input interface, an output interface, and a plurality of identical residual block groups (e.g., residual block groups 1142 in Figure 11C) that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection to enable an element-wise sum. Each identical residual block group including a plurality of distinct residual blocks that are coupled in series. Additionally and alternatively, in some embodiments not shown in Figure 15, the residual block based network 1410 includes an input interface, an output interface, and a plurality of residual block groups (e.g., residual block groups 1162-1 to 1162-M in Figure 1 ID) that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection. Each residual block group is made of a plurality of distinct residual blocks that are coupled in series according to a respective distinct order.

[00140] In optical photography, each pixel is a meta sample of a current input image 1402C. More samples typically provide more accurate representations of the current input image 1402C having the first resolution. While the electronic device uses a long-focus lens to obtain a high-resolution video in some embodiments, the range of the captured scene is usually limited by the size of the sensor array at the image plane. Thus, it is desirable for the electronic device to capture the wide-range scene at a lower resolution with a short-focus camera (e.g., a wide-angle lens), and then apply the single raw video super-resolution technique which recovers a high-resolution raw video from its low-resolution version.

[00141] Super-resolution is one of the most popular computer vision problems with many important applications to camera devices. In some embodiments, the image processing process 1500 is implemented in a raw image domain using optical flow-based motion compensation and frame-recurrent strategy. An image processing model 1418 takes the current input image 1402C having the first resolution and the prior output image 1404P having the second image (i.e., a warped estimation of the prior input image 1402P) as inputs. A warping operation maps the prior output image 1404P to the current output image 1404C using optical flow information. In some embodiments, the optical flow network 1416 includes an encoder-decoder network (e.g., a U-Net structure) having no skip connection. [00142] Figure 16 is a flow diagram of another example image processing process 1600 that increases an image resolution for ISR or VSR using a residual block based network 1410, in accordance with some embodiments. A current output image 1404C is generated from a current input image 1402C using the residual block based network 1410. A next input image 1402N follows the current input image 1402C in the sequence of image frames including the prior and current input images 1402P and 1402C. Optionally, the next input image 1402N immediately follows the current input image 1402C or is separated from the current input image 1402C by one or more images. A second output image 1602 is predicted from the current output image 1404C and a next optical flow map 1406N, which describes image motion between the current and next input images 1402C and 1402N. Specifically, the current output image 1404C is shifted by the next optical flow map 1406N to generate the second output image 1602. The next input image 1402N and the second output image 1602 are combined to generate a second combined input image 1604. In some embodiments, at least one resolution of the next input image 1402N and the second output image 1602 is adjusted, such that resolutions of the next input image 1402N and the second output image 1602 match each other and the next input image 1402N and the second output image 1602 can be concatenated to each other. The residual block based network 1410 is applied to generate a next output feature 1606 based on the second combined input image 1604. The next output feature 1606 is converted to a next output image 1404N having the second resolution.

[00143] Figure 17 is a flow diagram of an example image processing method 1700 for improving image quality using an SRRN, in accordance with some embodiments. Figure 18 is a flow diagram of an example image processing method 1800 for improving image quality using an SRRN, in accordance with some embodiments. Figure 19 is a flow diagram of an example image processing method 1900 for improving image quality using an optical flow map, in accordance with some embodiments. For convenience, each of the image processing methods 1700, 1800, and 1900 is described as being implemented by an electronic system 200 (e.g., a mobile phone 104C for Figures 17-19). Each of the image processing methods 1700, 1800, and 1900 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figures 17-19 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in methods 1700, 1800, and 1900 may be combined and/or the order of some operations may be changed.

[00144] Referring to Figure 17, in some embodiments, an image processing method

1700 is implemented at an electronic device including one or more processors and memory. An electronic device obtains (1702) an input image 702 (Figure 7) having a first resolution and extracts (1704) an image feature map 714 from the input image 702. The image feature map 714 is processed (1706) with an SRRN 710 (Figure 7) to generate an output feature map 716. The SRRN 710 includes (1708) a first residual block group 710A and a second residual block group 710B coupled to the first residual block group 710A, and each of the first and second residual block groups 710A and 710B is made from a plurality of residual blocks (e.g., RB1-RB4 in Figures 7-8B). The electronic device further converts (1710) the output feature map 716 to an output image 704 (Figure 7) having a second resolution that is greater than the first resolution. The output image 704 describes the same image content with the input image 702. The plurality of residual blocks are coupled (1712) in series according to a first order and a second order to form the first residual block group 710A and the second residual block group 710B, respectively. The second order is distinct from the first order. [00145] In some embodiments, the plurality of residual blocks of the first residual block group 710A having the first order are shifted (1714) circularly (e.g., counter-clockwise, clockwise) by one residual block to form the second residual block group 710B having the second order. In some embodiments, the plurality of residual blocks of the first residual block group 710A having the first order are shifted (1716) circularly (e.g., counter-clockwise, clockwise) by more than one residual block to form the second residual block group 710B having the second order.

[00146] In some embodiments, each of the plurality of residual blocks has (1718) a respective first position in the first residual block group 710A and a respective second position in the second residual block group 710B. The respective first position is distinct from the respective second position.

[00147] In some embodiments, the first order and the second order are randomly determined (1720) for the first residual block group 710A and the second residual block group 710B.

[00148] In some embodiments, the SRRN 710 includes one or more residual block groups (e.g., 710C and 710D in Figures 8A-8B) that are coupled in series with each other and to the second residual block group 710B, and the plurality of residual blocks are coupled in series according to a respective order to form each of the one or more residual block groups. [00149] In some embodiments, after setting the first and second orders for the first and second residual block groups 710A and 710B of the SRRN 710, a server or the electronic device trains the SRRN 710 to determine weights of each of the plurality of residual blocks. Further, in some embodiments, the server trains the SRRN 710, the server provides the SRRN 710 to the electronic device.

[00150] In some embodiments, the SRRN 710 includes a third residual block group 710C and a fourth residual block group 710D. The first, second, third, and fourth residual block groups 710A-710D are coupled in series with each other. The plurality of residual blocks EB1-EB4 includes four residual blocks that are coupled in series according to a third order and a fourth order to form the third residual block group 710C and the fourth residual block group 710D, respectively. The first, second, third, and fourth orders are distinct from each other.

[00151] In some embodiments, referring to Figure 9, a plurality of residual block groups 710 includes the first residual block group 710A and the second residual block group 710B, and each residual block group has a distinct order of residual blocks. Distinct orders of residual blocks in the first and second residual block groups 710A and 710B are determined by generating a respective feature map 908 A, ...., 908N from a test image 906 for each residual block group 908A ..., or 908N using the respective residual block group 902A, ..., 902N, determining that the first and second residual block groups 710A and 710B have a smallest similarity level among any two of the plurality of residual block groups 902A-902N, and selecting the first and second residual block groups 710A and 710B for the SRRN 710. [00152] In some embodiments, referring to Figure 7, each of the plurality of residual blocks includes an input interface 722, a first convolutional layer 724, an ReLU 726, a second convolution layer 728, an output interface 730, and a skip connection 732, and the skip connection 732 couples the input interface 722 to the output interface 730.

[00153] In some embodiments, the input image 702 includes (1722) an RGB color image. Before obtaining the input image 702, the electronic device obtains a raw image captured by an image sensor array, and performs, by an image signal processor (ISP), image processing operations on the raw image to generate the input image 702.

[00154] In some embodiments, the input image 702 includes (1724) a raw image captured by an image sensor array. After converting the output feature map 716 to the output image 704, the electronic device performs, by an image signal processor (ISP), image processing operations on the output image 704 to generate an RGB color image. The image processing operations includes one or more of demosaicing, denoising, and auto functions. [00155] In some embodiments, the input image 702 is obtained with a video clip including a sequence of image frames. The sequence of image frames include the input image

702. Each image frame has the first resolution. [00156] Referring to Figure 18, in some embodiments, an image processing method 1700 is implemented at an electronic device including one or more processors and memory. An electronic device obtains (1802) a prior input image 1002P (Figure 10) and a current input image 1002C (Figure 10) that follows the prior input image 1002P in a sequence of image frames having a first resolution. The electronic device applies (1804) a residual block based network 1010 (Figure 10) to generate a prior output feature 1006P based on the prior input image 1002P, and combines (1806) the current input image 1002C and the prior output feature 1006P to generate a current input feature 1008C. The residual block based network 1010 is applied (1808) to generate a current output feature 1006C based on the current input feature 1008C. The electronic device converts (1810) the current output feature 1006C to a current output image 1004C (Figure 10) having a second resolution. The second resolution is greater than the first resolution.

[00157] In some embodiments, the current input image 1002C and the prior output feature 1006P are combined by converting (1812) the first resolution of the current input image 1002C based on a resolution of the prior output feature 1006P generated based on the prior input image 1002P, concatenating (1814) the current input image 1002C and the prior output feature 1006P to generate a concatenated input image 1012, and extracting (1816) the current input feature 1008C from the concatenated input image 1012.

[00158] In some embodiments, the residual block based network 1010 (Figure 1 IB) includes (1818) an input interface 10101, an output interface 10100, and a plurality of distinct residual blocks that are coupled in series between the input and output interface 10100s, and the input interface 10101 is coupled to the output interface 10100 via a skip connection 1122. In some embodiments, the residual block based network 1010 (Figure 11C) includes (1820) an input interface 10101, an output interface 10100, and a plurality of identical residual block groups 1142 that are coupled in series between the input and output interface 10100s, and the input interface 10101 is coupled to the output interface 10100 via a skip connection 1144. Each identical residual block group 1142 includes a plurality of distinct residual blocks that are coupled in series. In some embodiments, the residual block based network 1010 (Figure 1 ID) includes (1822) an input interface 10101, an output interface 10100, and a plurality of residual block groups 1162 that are coupled in series between the input and output interface 10100s, and the input interface 10101 is coupled to the output interface 10100 via a skip connection 1164. Each residual block group is made of a plurality of distinct residual blocks that are coupled in series according to a respective distinct order. [00159] In some embodiments, the sequence of image frames is started with the prior input image 1002P. Alternatively, in some embodiments, the sequence of image frames includes a plurality of successive GOPs, and at least one GOP is started with the prior input image 1002P. The electronic device creates an initial feature map 1202 (Figure 12), and all elements of the initial feature map 1202 are equal to 0. The electronic device combines the prior input image 1002P and the initial feature map 1202 to generate a prior input feature 1008P, and generates the prior output feature 1006P from the prior input feature 1008P using the residual block based network 1010.

[00160] In some embodiments, the prior input image 1002P includes a first input image, and the prior output feature 1006P includes a first prior output feature. The electronic device obtains a second input image 1002P’. The first input image 1002P follows the second input image 1002P’. The residual block based network 1010 is applied to generate a second prior output feature 1006P’ based on the second input image 1002P’. The current input image 1002C is combined with both the first and second prior output features 1006P and 1006P’ to generate the current input feature 1008C.

[00161] In some embodiments, the current input image 1002C immediately follows (1824) the prior input image 1002P in the sequence of image frames.

[00162] In some embodiments, each image of the sequence of image frames includes a respective RGB color image. Before obtaining the prior and current input images 1002P and 1002C, the electronic device obtains a sequence of raw images including a prior raw image and a current raw image. The sequence of raw images captured by an image sensor array. The electronic device performs, by an image signal processor (ISP), image processing operations on the prior and current raw images to generate the prior and current input images 1002P and 1002C, respectively.

[00163] In some embodiments, each image of the sequence of image frames includes a respective raw image captured by an image sensor array. After converting the current output feature 1006C to the current output image 1004C, an ISP of the electronic device performs image processing operations on the current output image 1004C to generate a current RGB color image.

[00164] In some embodiments, the electronic device obtains a next input image 1002N (Figure 13) that follows the current input image 1002C in the sequence of image frames. The next input image 1002N and the current output feature 1006C are combined to generate a next input feature 1008N. The residual block based network 1010 is applied to generate a next output feature 1006N based on the next input feature 1008N. The electronic device converts the next output feature 1006N to a next output image 1004N having the second resolution.

[00165] In some embodiments, the residual block based network 1010 has a plurality of layers and includes a plurality of weights associated with a respective number of filters of each layer. The plurality of weights are quantized an int8, uinl8. intl6 or uintl6 format based on a precision setting of an electronic device.

[00166] Referring to Figure 19, in some embodiments, an image processing method 1700 is implemented at an electronic device including one or more processors and memory. An electronic device obtains (1902) a prior input image 1402P (Figure 14) and a current input image 1402C (Figure 14) that follows the prior input image 1402P in a sequence of image frames having a first resolution, and applies (1904) a residual block based network 1410 (Figure 14) to generate a prior output image 1404P based on the prior input image 1402P. The prior output image 1404P has a second resolution greater than the first resolution. The electronic device predicts (1906) a first output image 1408 from the prior output image 1404P and a current optical flow map 1406C, and the current optical flow map 1406C describes image motion between the prior and current input images 1402P and 1402C. The current input image 1402C and the first output image 1408 are combined (1908) to generate a combined input image 1412. The electronic device applies (1910) the residual block based network 1410 to generate a current output feature 1426 based on the combined input image 1412, and converts (1912) the current output feature 1426 to a current output image 1404C (Figure 14) having the second resolution.

[00167] In some embodiments, the electronic device applies (1914) an optical flow network 1416 to generate the current optical flow map 1406C from the prior and current input images 1402P and 1402C. The current optical flow map 1406C (1916) includes a plurality of elements, and each element represents an image motion value of a corresponding object between one or more respective pixels of the prior input image 1402P and one or more respective pixels of the current input image 1402C. Further, in some embodiments, the current optical flow map 1406C includes a first optical flow map. The electronic device increases (1918) a resolution of the first optical flow map to the second resolution to generate a second optical flow map having the second resolution, and down-samples (1920) the first output image 1408 from the second resolution to the first resolution. The first output image 1408 is predicted from the prior output image 1404P and the second optical flow map, and the down-sampled first output image 1408 is combined with the current input image 1402C. [00168] Additionally, in some embodiments, the optical flow network 1416 includes an encoder-decoder network (e.g., a U-net having no skip connection).

[00169] In some embodiments, the current input image 1402C and the first output image 1408 are combined by concatenating the current input image 1402C and the first output image 1408 to generate the combined input image 1412.

[00170] In some embodiments, the residual block based network 1410 includes an input interface, an output interface, and a plurality of distinct residual blocks 1502 (Figure 15) that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection 1504.

[00171] In some embodiments, the residual block based network 1410 includes an input interface, an output interface, and a plurality of identical residual block groups that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection. Each identical residual block group includes a plurality of distinct residual blocks that are coupled in series.

[00172] In some embodiments, the residual block based network 1410 includes an input interface, an output interface, and a plurality of residual block groups that are coupled in series between the input and output interfaces, and the input interface is coupled to the output interface via a skip connection. Each residual block group is made of a plurality of distinct residual blocks that are coupled in series according to a respective distinct order. [00173] In some embodiments, each image of the sequence of image frames includes a respective RGB color image. Before obtaining the prior and current input images 1402P and 1402C, the electronic device obtains a sequence of raw images including a prior raw image and a current raw image, and the sequence of raw images is captured by an image sensor array. An image signal processor (ISP) performs image processing operations on the prior and current raw images to generate the prior and current input images 1402P and 1402C, respectively.

[00174] In some embodiments, each image of the sequence of image frames includes a respective raw image captured by an image sensor array. After converting the current output feature 1426 to the current output image 1404C, the electronic device performs, by an ISP, image processing operations on the current output image 1404C to generate a current RGB color image.

[00175] In some embodiments, the current input image 1402C immediately follows the prior input image 1402P in the sequence of image frames. Alternatively, in some embodiments, the current input image 1402C follows the prior input image 1402P in the sequence of image frames and is separated from the prior input image 1402P by a predefined number of image frames.

[00176] In some embodiments, the electronic device obtains a next input image 1402N (Figure 16) that follows the current input image 1402C in the sequence of image frames and predicts a second output image 1602 from the current output image 1404C and a next optical flow map 1406N. The next optical flow map 1406N describes image motion between the current and next input images 1402C and 1402N. The electronic device combines the next input image 1402N and the second output image 1602 to generate a second combined input image 1604, applies the residual block based network 1410 to generate a next output feature based on the second combined input image 1604, and converts the next output feature to a next output image 1404N having the second resolution.

[00177] It should be understood that the particular order in which the operations in each of the image processing methods 1700, 1800, and 1900 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to process image data. Additionally, it should be noted that details of other processes described above with respect to Figures 1-16 are also applicable in an analogous manner to each of the image processing methods 1700, 1800, and 1900 described above with respect to Figures 17-19. For brevity, these details are not repeated here.

[00178] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. [00179] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[00180] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00181] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.