Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR LOW LIGHT VIDEO ENHANCEMENT
Document Type and Number:
WIPO Patent Application WO/2023/177388
Kind Code:
A1
Abstract:
This application is directed to image processing using deep learning techniques. An electronic device obtains an input image and associated image metadata including an image characteristic related to a brightness level of the input image, generates a noise map from the input image using a sequence of image processing networks, and outputs an output image generated from the input image and the noise map. One or more intermediate feature maps are generated from the sequence of image processing networks. At least one intermediate feature map is modified based on the image characteristic in the image metadata of the input image. In some embodiments, the image characteristic is normalized and used to scale each element of the intermediate feature map that is modified. The scaled intermediate feature map is combined with the first feature map and provided to a next image processing network for further processing.

Inventors:
DING JIAMING (US)
MENG ZIBO (US)
SHEN JINGLIN (US)
Application Number:
PCT/US2022/020212
Publication Date:
September 21, 2023
Filing Date:
March 14, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06N3/02; G06N3/08; G06T3/40
Domestic Patent References:
WO2020252764A12020-12-24
WO2021177758A12021-09-10
Foreign References:
US20160210730A12016-07-21
US20200175352A12020-06-04
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. An image processing method, implemented at an electronic device having memory, comprising: obtaining an input image and associated image metadata, the associated image metadata including an image characteristic related to a brightness level of the input image; generating a noise map from the input image using a sequence of image processing networks, including: generating one or more intermediate feature maps from the sequence of image processing networks, the one or more intermediate feature maps distinct from the noise map and including a first feature map; and modifying the first feature map based on the image characteristic; and generating an output image from the input image and the noise map.

2. The method of claim 1, further comprising: determining a denoising level from the image characteristic; and adjusting the noise map using the denoising level, wherein the adjusted noise map is used to generate the output image.

3. The method of claim 1 or 2, wherein modifying the first feature map based on the image characteristic further comprises: normalizing the image characteristic; scaling each element of the first feature map using the normalized image characteristic to generate a scaled feature map; and combining the first feature map and the scaled feature map to generate a combined feature map, wherein the combined feature map is provided to and processed by a next image processing network.

4. The method of claim 3, wherein combining the first feature map and the scaled feature map further comprises: concatenating the first feature map and the scaled feature map in a channel dimension to generate a concatenated feature map; and processing the concatenated feature map with a respective convolutional neural network to generate the combined feature map.

5. The method of any of claims 1-4, wherein: the sequence of image processing networks includes a first network, a second network that follows the first network, and a third network that follows the second network; the one or more intermediate feature maps further include a second feature map; the first and second feature maps are generated from the first and second networks, respectively; and the method further comprises: modifying the second feature map based on the image characteristic; and processing the modified second feature map with the third network.

6. The method of any of claims 1-5, wherein the sequence of image processing networks includes a first network and a second network that follows the first network, and the first feature map is generated from the first network, wherein the one or more intermediate feature maps include a second feature map, the method further comprising: receiving the modified first feature map by the second network; and generating the second feature map from the modified first feature map using the second network.

7. The method of claim 6, wherein: the first network includes a first convolutional neural network that receives the input image and generates the first feature map; and the second network includes at least one of a convolutional neural network and an encoder-decoder network.

8. The method of claim 7, wherein the second network includes the encoder-decoder network and convolutional neural network and that are arranged in an ordered sequence, and the noise map is generated from the second network.

9. The method of any of claims 1-5, wherein the sequence of image processing networks includes an encoder-decoder network, and the encoder-decoder network further includes a series of encoding stages, a series of decoding stages, and a bottleneck network coupled between the series of encoding stages and the series of decoding stages, and wherein the first feature map is generated by one of the encoding stages, decoding stages, and bottleneck network.

10. The method of any of the preceding claims, further comprising: determining a pixel-wise Ll-norm loss based on the output image and a ground truth image of the input image; determining an edge loss between a first edge image and a second edge image, the first and second edge images including edge information of the output and ground truth images, respectively; determining a content loss between a first semantic map extracted from the output image and a second semantic map extracted from the ground truth image; determining a comprehensive loss combining the pixel -wise Ll-norm loss, edge loss, and content loss in a weighted manner; and training the sequence of image processing networks using the comprehensive loss.

11. The method of any of the preceding claims, wherein each of the sequence of image processing networks includes a plurality of weights associated with a respective plurality of filters of each layer, the method further comprising: quantizing the plurality of weights based on a data format, including maintaining the data format for the plurality of weights while training the sequence of image processing networks using a predefined loss function.

12. The method of claim 11, wherein the data format of the plurality of weights is selected based on a precision setting of an electronic device, the method further comprising: providing the sequence of image processing network including the quantized weights to the electronic device.

13. The method of claim 11, wherein the data format is selected from floal32. int8, uint8, inti 6. and uintl6.

14. The method of any of the preceding claims, wherein the image characteristic includes an ISO value.

15. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.

Description:
Methods and Systems for Low Light Video Enhancement

TECHNICAL FIELD

[0001] This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for enhancing an image using deep learning techniques.

BACKGROUND

[0002] Images captured under low illumination conditions typically have a low signal-to-noise ratio (SNR) and do not have a good perceptual quality. Exposure times are extended to improve image quality, and however, resulting images can become blurry. Image optimization schemes (e.g., partial differential equation (PDE), domain transformation, nonlocal patching) have been explored to remove image noise caused by low illumination. These optimization schemes usually involve time-consuming iterative inference and heavily rely on tunning parameters, which can be time-consuming and demand a large amount of computational resources. Convolutional neural networks (CNNs) have also been applied in image enhancement and noise reduction to provide better denoised results than the image optimization schemes. However, many CNNs demand such a large amount of computational resources that they cannot be implemented on a mobile device, particularly when a CNN runs along with an image signal processor (IPS). It would be beneficial to have an effective and efficient mechanism to improve image quality and remove image noises, e.g., for images captured under the low illumination conditions, without demanding a large amount of computational resources.

SUMMARY

[0003] Various embodiments of this application are directed to removing noise and improving a visual quality of low light visual content (e.g., images and video) based on deep learning techniques. Cameras may acquire noisy images under low lighting conditions because of using insufficient exposure times, e.g., to meet a real-time requirement. In this application, deep learning techniques are applied in a YUV image domain to remove noise in visual content in real time. The deep learning techniques rely on a sequence of neural networks, and at least one intermediate feature map generated by the neural networks is modified based on an image characteristic of the visual content. The image characteristic may also be applied to adjust a denoising strength or level, thereby producing brighter and lower noise high-quality visual content. An example of the image characteristic is a sensitivity of an image sensor of a camera represented by ISO settings. Further, quantization aware training (QAT) is applied to train the sequence of neural networks according to precision settings of a mobile device, such that the trained neural networks may be implemented by the mobile device for real-time denoising of the visual content. By these means, the image characteristic that is provided with visual content is applied jointly with neural networks, e.g., in a mobile device, to improve image quality and remove image noises particularly for images captured under low illumination conditions.

[0004] In one aspect, an image processing method is implemented at an electronic device having memory. The method includes obtaining an input image and associated image metadata, and the associated image metadata includes an image characteristic related to a brightness level of the input image. The methods further includes generating a noise map from the input image using a sequence of image processing networks, including generating one or more intermediate feature maps from the sequence of image processing networks. The one or more intermediate feature maps are distinct from the noise map and include a first feature map. Generating the noise map further includes modifying the first feature map based on the image characteristic. The method further includes generating an output image from the input image and the noise map. In an example, the image characteristic includes an ISO value. In some embodiments, the method includes determining a denoising level from the image characteristic and adjusting the noise map using the denoising level, wherein the adjusted noise map is used to generate the output image.

[0005] In some embodiments, modifying the first feature map further includes normalizing the image characteristic, scaling each element of the first feature map using the normalized image characteristic to generate a scaled feature map, and combining the first feature map and the scaled feature map to generate a combined feature map. The combined feature map is provided to and processed by a next image processing network.

[0006] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. [0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0008] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.

Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0011] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.

[0012] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0013] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0014] Figure 5 is a block diagram of an example image processing system, in accordance with some embodiments.

[0015] Figure 6 is a block diagram of a low light processing module applied in an electronic device, in accordance with some embodiments.

[0016] Figure 7 is a block diagram of another example low light processing module that processes an input image using a sequence of image processing networks, in accordance with some embodiments.

[0017] Figure 8A is a block diagram of an example encoder-decoder network (e.g., a U-net) in which at least one intermediate feature map is modified based on an image characteristic of an input image, in accordance with some embodiments. [0018] Figure 8B is a flow diagram of an example process of modifying an intermediate feature map based on an image characteristic of an input image, in accordance with some embodiments.

[0019] Figure 9 is a flow diagram of an example image processing method, in accordance with some embodiments.

[0020] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0021] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0022] An electronic device employs an image characteristic of visual content to adjust one or more intermediate feature maps generated by a sequence of image processing networks, thereby improving a perceptual quality and preserving visual details of the visual content taken under low illumination conditions. The image characteristic optionally includes ISO settings that define a sensitivity and a signal gain of an image sensor array of a camera. The sequence of image processing networks optionally includes a convolutional neural network (CNN). In some embodiments, a denoising strength determines a strength of a denoising effect and is adjusted based on the image characteristic. The darker an environment is, the stronger the denoising strength is. A strong denoising strength is applied to an extremely dark environment. In some embodiments, this image enhancement capability is integrated in a user application of an electronic device (e.g., a camera application). When the visual content is recorded or previewed, the user application optionally provides a user- selectable affordance item, allowing the electronic device to implement the sequence of image processing networks to process the visual content based on the image characteristic in response to a user selection of the affordance item.

[0023] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0024] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0025] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0026] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models (e.g., an encoder-decoder network 800 in Figure 8) are created based on one or more neural networks to process the content data. These data processing models are trained with training data, e.g., in a server 102, before they are applied to process the content data in either the server 102 or a client device 104. In an example, subsequently to model training, the mobile phone 104C or HMD 104D obtains the data processing models and processes content data using the data processing models locally.

[0027] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0028] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0029] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0030] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., a mobile phone 104C in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a camera 260, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.

[0031] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; • One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 includes a sequence of image processing networks that generates a noise map 708 from an input image 702 to reduce a noise level of the input image 702; and

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 250; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where in some embodiments, the data processing models 240 includes an encoder-decoder network 800 (e.g., a U-net) in Figure 8A or a sequence of image processing networks 704 in Figure 7; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on the client device 104. [0032] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.

[0033] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0034] Figure 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0035] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0036] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0037] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0038] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wy, W2, wj, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0039] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0040] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0041] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0042] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0043] Figure 5 is a block diagram of an example image processing system 500 applied in an electronic device having a camera 260, in accordance with some embodiments. An example of the electronic device is a mobile phone 104C. The electronic device executes a camera application 502 configured to generate visual content 504 (e.g., static images, video content including a sequence of images) and stores the visual content 504 in an album 506. The camera application 502 includes a record and review front module 508, an image processing module 510, and a media store 512. The record and review front module 508 obtains visual data 514 from the camera 260 or from a user application (e.g., the album 506). The visual data 514 is captured by the camera 260 of the electronic device or provided by the user application to be previewed on the electronic device. The image processing module 510 applies image processing operations (e.g., compression, noise filtering, color correction) on the visual data 514 to generate visual data 524. The media store 512 encodes the visual data 524 to the visual content 504, temporarily stores the visual content 504, and provides the visual content 504 for storage in the album 506.

[0044] The image processing module 510 includes an image signal processing (ISP) module 516 and a low light processing module 518. The ISP module 516 receives the visual data 514 that includes raw images, and converts the visual data 514 to a digital format while performing image processing operations (e.g., compression, noise filtering, color correction) on the visual data 514. Converted visual data 520 optionally has a YUV image format and is generated with metadata 522 including one or more image characteristics of the visual data 520. Both the visual data 520 and the metadata 522 are provided to the low light processing module 518. The low light processing module 518 is configured to reduce a noise level of the visual data 520 based on the metadata 522 and generate the visual content 504. In an example, an image characteristic in the metadata 522 is associated with a brightness level of the visual data 514, and includes an ISO value that determines a light sensitivity of image sensors of the camera 260.

[0045] In some embodiments, the image processing module 510 is implemented in an Android hardware abstraction layer (HAL) and defines a standard interface that image services call into. Based on the image processing module 510, the camera application 502 is required to ensure that image sensors of the camera 260 function correctly. Stated another way, when the visual content 504 is recorded or previewed by way of the camera application 502, the visual content 504 is processed on the HAL. The low light processing module 518 is inserted between the ISP module 516 and the media store 512, and configured to intercept the visual data 520 and associated metadata 522, process the visual data 520 based on the metadata 522, and returns the resulting visual data 524 to the media store 512. The media store 512 encodes the visual data 524 for preview or storage.

[0046] Figure 6 is a block diagram of an example low light processing module 518 applied in an electronic device, in accordance with some embodiments. The low light processing module 518 is applied in addition to the ISP module 516 to improve an image quality of each image in the visual content 504, e.g., when the visual content 504 is captured under low lighting conditions. In some situations, an ISO value, an exposure time, or both are increased under a low lighting condition, which may compromise details in images of the visual content 504. The low light processing module 518 is implemented based on deep learning techniques, e.g., based on a sequence of image processing networks. An image characteristic of the visual data 514 associated with a brightness level of the visual data 514 reflects a lighting condition of the visual data 514, and therefore, is applied to modify one or more intermediate feature maps generated by the sequence of image processing networks. By these means, an output of the sequence of image processing networks takes into account and counteracts an impact of the image characteristic of the visual data 514 associated with the brightness level of the visual data 514.

[0047] The low light processing module 518 includes a denoising control module 602. The denoising control module 602 is configured to receive the image characteristic 610 in the metadata 522 of the visual data 520 and control a denoising strength or level 612 based on the image characteristic of each image frame in the visual data 520. The image characteristic 610 is associated with a brightness level of each image frame, and for example, includes an ISO value that is inversely proportional to the brightness level. The denoising level 612 is generated by the denoising control module 602 and provided to the model-based processing module 606. The denoising level 612 is within a denoising level range that is defined between 0 and 1. In some embodiments, the denoising level 612 is generated as a function of the image characteristic 610 (e.g., the ISO value) based on linear interpolation. For example, a set of data point (x ; ) are set, where x ; is the ISO value and jj is the corresponding denoising level 612. The ISO value is sampled every 5000, and selected from a set of predefined ISO values [0, 5000, 10000, . . . ]. Given the ISO value x, the denoising control module 602 identifies two nearest predefined ISO values xo and xi and two corresponding predefined denoising levels yo a yi, and the ISO value x is between the two nearest predefined ISO values, i.e., xo< x< xi. The denoising level 612 (y) corresponding to the ISO value x is linearly interpolated between the two predefined denoising levels yo an yi corresponding to the two nearest predefined ISO values xo and xi as follows:

Further, in some embodiments, the ISO value is normalized between 0 and 1. For example, the ISO value corresponds to an ISO range of 0-20000, which is normalized to 0-1. If the ISO value of an image in the visual data 520 is 10000, the ISO value is normalized to 0.5.

[0048] The low light processing module 518 further includes one or more of a data pre-processing module 604, a model-based processing module 606, and a data postprocessing module 608. In some embodiments, the data pre-processing module 604 separates the Y channel from the U and V channels of the visual data 520, and reshapes the Y channel from 1 channel into 16 channels, e.g., for image processing using NEON (a hardware module in an Android phone) which enables single instruction multiple data (SIMD) operations. Visual data 614 including the 16 channels are provided to the model-based processing module 606. The model-based processing module 606 processes the visual data 614 to generate processed visual data 616 using a sequence of image processing networks. In an example, the sequence of image processing networks are implemented by a neural processing unit (NPU). The data post-processing module 608 is coupled to the model-based processing module 606, and configured to receive the processed visual data 616 and process the visual data 616 to the visual data 524 outputted by the low light processing module 518. In some embodiments, the data post-processing module 608 reshapes the visual data 616 from 16 channels to 1 channel and returns the resulting visual data 524 to a corresponding ISP pipeline, e.g., to a media store 512 in the camera application 502. [0049] Figure 7 is a block diagram of another example low light processing module 518 that processes an input image 702 using a sequence of image processing networks 704, in accordance with some embodiments. The input image 702 is obtained with image metadata 522 including an image characteristic 610 related to a brightness level of the input image 702. In this example, the image characteristic 610 includes an ISO value 706 that is optionally normalized. The low light processing module 518 includes a sequence of image processing networks 704, e.g., including networks 704A, 704B, and 704C, and generates a noise map 708 from the input image 702. The noise map 708 is combined with the input image 702 using a denoising level 612 to generate an output image 710. In some embodiments, only one of a plurality of image channels of the input image 702 (e.g., Y Channel) is processed by the low light processing module 518. In some embodiments, the output image 710 is clipped to a predefined image values (e.g., 0-255).

[0050] The image characteristic 610 (e.g., the ISO value 706) is applied in at least two aspects in the low light processing module 518. In one aspect, the ISO value 706 is applied with the sequence of image processing networks 704 to generate the noise map 708. The sequence of image processing networks 704 generate one or more intermediate feature maps 712, and at least one of the one or more intermediate feature maps 712 is modified based on the ISO value 706 before the at least one of the one or more intermediate feature maps 712 is processed by a next image processing network 704. In some situations, a neural network 705 is applied to modify the at least one intermediate feature map 712 based on the ISO value 706. In another aspect, the ISO value 706 is used to generate a denoising level 612 in a denoising control module 602, and the denoising level 612 is applied to adjust the noise map 708 that is further combined with the input image 702 to generate the output image 710. In an example, the noise map 708 generated by the sequence of image processing networks 704 is multiplied by the denoising level 612 prior to being added to the input image 702.

[0051] In some embodiments, the sequence of image processing networks 704 includes three networks 704A, 704B, and 704C arranged according a network order. These three networks 704A-704C are separated by two feature maps 712 including a first feature map 712A generated by a first network 704 A and a second feature map 712B generated by a second network 704B. Each of the intermediate feature maps 712A and 712B is distinct from the input image 702 and the output image 710 of the sequence of image processing network 704. In an example shown in Figure 7, the first feature map 712A generated by the first network 704A is modified based on the ISO value 706. The modified first feature map 712 A' is provided to the second network 704B in place of the first feature map 712A, and the second network 704B converts the modified first feature map 712 A’ to the second feature map 712B. In another example, the second feature map 712B is modified based on the ISO value 706, and the modified second feature map is provided to the third network 704C in place of the second feature map 712B. In some embodiments, both of the intermediate feature maps 712A and 712B are modified based on the ISO value 706, and the two modified intermediate feature maps are provided to the second network 704B and third network 704C, respectively.

[0052] Referring to Figure 7, in some embodiments, the ISO value 706 is normalized and used to scale each element of the first feature map 712A to generate a scaled feature map 714. The first feature map 712A and the scaled feature map 714 are combined to generate a combined feature map, and the combined feature map is provided to and processed by a neural network 705 (e.g., a CNN 705), thereby generating the modified first feature map 712A’. In some embodiments, the first feature map 712A and the scaled feature map 714 are combined by concatenating the first feature map 712A and the scaled feature map 714 in a channel dimension to generate the combined feature map. The modified first feature map 712A’ is further provided to the other networks 704B and 704C to generate the noise map 708. The second network 704B includes an encoder-decoder network (e.g., a U-net), and the third network 704C includes a third CNN that optionally functions as a pooling network to convert the second feature map 712B to the noise map 708.

[0053] The low light processing module 518 is implemented at an electronic device and based on a data processing model 240 that includes the sequence of image processing networks 704 and the neural network 705. In some embodiments, the data processing model 240 is trained at a server 102 and provided to the electronic device. The server 102 trains the data processing model 240 end-to-end based on a loss function 716. During training, the input image 702 is a test image provided with a ground truth image 718. After the output image 710 is generated from the input image 702 using the data processing model 240 and based on the ISO value 706, the loss function 716 is determined based on the output image 710 and the ground truth image 718. In an example, the loss function 716 is a weighted combination of a pixel-wise Ll-norm loss, an edge loss, and a content loss. The pixel-wise Ll-norm loss is a combination of the output image 710 and the ground truth image 718. The edge loss is determined between a first edge image and a second edge image. The first and second edge images include edge information of the output and ground truth images 710 and 718, respectively. The content loss is determined between a first semantic map extracted from the output image 710 and a second semantic map extracted from the ground truth image 718. The sequence of image processing networks 704 and neural network 705 are trained jointly using the loss function 716 that combines the pixel-wise Ll-norm loss, edge loss, and content loss. [0054] Specifically, in some embodiments, the pixel -wise Ll-norm loss includes a least absolute deviations (LAD) or a least absolute errors (LAE), and corresponds to a sum of absolute differences D between a target value y, and the estimated values f(xt)

D = \yt - f^d \ (2)

A structural similarity index (SSIM) is applied to measure a perceptual similarity between two images. The SSIM includes three independent comparison measurements: luminance, contrast and structure. For a multi-scale SSIM, different scales of images are extracted from an original image by applying low-pass filtering and downsampling operations, and the SSIM is applied on the different scales of images. In some embodiments, a Sobel edge filter is applied to extract edges in the output image 710 and ground truth image 718. An Ll-norm loss is calculated for the extracted edges of the output image 710 and ground truth image 718 to represent the edge loss, During training the edge loss is reduced to reduce an edge difference and keep edge details in the output image 710. In some embodiments, the content loss is applied to improve a perceptual quality of the output image 710. Layers of a pretrain model are applied to extract sematic feature maps form both the output image 710 and ground truth image 718. An LI distance is determined between the sematic feature maps to represent the content loss. In an example, the pretrain model includes the fifth layer and the ninth layer of a pretrain VGG19 network including a CNN that is 19 layers deep.

[0055] Quantization is applied to perform computations and store tensors at lower bitwidths than a floating point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. In some embodiments, the data processing model 240 is quantized according to a precision setting of an electronic device where the data processing model 240 will be loaded. For example, the electronic device is a mobile device having limited computational resource and has a lower precision than a floating point data format. Weights of the data processing model 240 are quantized based on the lower precision. The quantized data processing model 240 result in a significant accuracy drop, and make image processing a lossy process. In some embodiments, the data processing model 240 is re-trained with the quantized weights to minimize the loss function 716. Such quantization-aware training simulates low precision behavior in a forward pass, while a backward pass remains the same, which induces a quantization error which is accumulated in the loss function, and an optimizer module is applied to reduce the quantization error.

[0056] In some embodiments, weights associated with filters of the sequence of image processing networks 704 and neural network maintain a float32 format, and are quantized based on a precision setting of an electronic device. For example, the weights are quantized from the float 32 format to an inl8. uinl8. inti 6. or uintl6 format based on the precision setting of the client device 104. Specifically, in an example, the client device 104 uses a CPU to run the sequence of image processing networks 704 and neural network 705, and the CPU of the client device 104 processes 32 bit data. The weights of the image processing networks 704 and neural network 705 are not quantized, and the image processing networks 704 and neural network 705 are provided to the client device 104 directly. In another example, the client device 104 uses one or more GPUs to run the image processing networks 704 and neural network 705, and the GPU(s) process 16 bit data. The weights of the image processing networks 704 and neural network 705 are quantized to an intl6 format. In yet another example, the client device 104 uses a DSP to run the image processing networks 704 and neural network 705, and the DSP processes 8 bit data. The weights of the image processing networks 704 and neural network 705 are quantized to an int8 format. After quantization of the weights, e.g., to a fixed 8-bit format, the image processing networks 704 have fewer MACs and smaller size, and are hardware-friendly during deployment on the client device 104.

[0057] Figure 8A is a block diagram of an example encoder-decoder network 800 (e.g., a U-net) in which at least one intermediate feature map is modified based on an image characteristic 610 of an input image 802, in accordance with some embodiments. An electronic device employs the encoder-decoder network 800 (e.g., with or without the networks 704A and 704C) to perform image denoising and enhancement operations, thereby improving perceptual quality of an image taken under low illumination condition. For example, the U-net is employed to predict a noise map 804 (Al of the input image 802 based on an equation, AI=f(I:wf where w is a set of learnable parameters of the U-net. A denoised output image is a sum of the input image 802 and the predicted noise map 804, and has better image quality (e.g., has a higher SNR) than the input image 802. In the U-net, the input image is processed successively by a set of downsampling stages (i.e., encoding stages) 806 to extract a series of feature maps, as well as to reduce spatial resolutions of these feature maps successively. An encoded feature map outputted by the encoding stages 806 is then processed by a bottleneck network 808 followed by a set of upscaling stages (i.e., decoding stages) 810. The series of decoding stages 810 include the same number of stages as the series of encoding stages 806. In some embodiments, in each decoding stage 810, an input feature map 816 is upscaled and concatenated with the layer of the same resolution from the encoding stage 806 to effectively preserve the details in the input image 802.

[0058] In an example, the encoder-decoder network 800 has four encoding stages 806A-806D and four decoding stages 810A-810D. The bottleneck network 808 is coupled between the encoding stages 806 and decoding stages 810. The input image 802 is successively processed by the series of encoding stages 806A-806D, the bottleneck network 808, and the series of decoding stages 810A-810D to generate the noise map 804. In some embodiments, an original image is divided into a plurality of image tiles, and the input image 802 corresponds to one of the plurality of image tiles. Stated another way, in some embodiments, each of the plurality of image tiles is processed using the encoder-decoder network 800, and all of the image tiles in the original image are successively processed using the encoder-decoder network 800. Each output tile is collected and combined with one another to reconstruct a final noise map corresponding to the original image.

[0059] The series of encoding stages 806 include an ordered sequence of encoding stages 806, e.g., stages 806A, 806B, 806C, and 806D, and have an encoding scale factor. Each encoding stage 806 generates an encoded feature map 812 having a feature resolution and a number of encoding channels. Among the encoding stages 806A-806D, the feature resolution is scaled down and the number of encoding channels is scaled up according to the encoding scale factor. In an example, the encoding scale factor is 2. A first encoded feature map 812A of a first encoding stage 806A has a first feature resolution (e.g., // x W) related to the image resolution and a first number of (e.g., NCH) encoding channels, and a second encoded feature map 812B of a second encoding stage 806B has a second feature resolution (e.g., V2HXV2W) and a second number of (e.g., 2NCH) encoding channels. A third encoded feature map 812C of a third encoding stage 806C has a third feature resolution (e.g., V H^ V W) and a third number of (e.g., 4NCH) encoding channels, and a fourth encoded feature map 812D of a fourth encoding stage 806D has a fourth feature resolution (e.g., VsH^VsW) and a fourth number of (e.g., 8NCH) encoding channels. [0060] For each encoding stage 806, the encoded feature map 812 is processed and provided as an input to a next encoding stage 806, except that the encoded feature map 812 of a last encoding stage 806 (e.g., stage 806D in Figure 8) is processed and provided as an input to the bottleneck network 808. Additionally, for each encoding stage 806, a pooled feature map 814 is converted from the encoded feature map 812, e.g., using a max pooling layer. The pooled feature map 814 is temporarily stored in memory and extracted for further processing by a corresponding decoding stages 810. Stated another way, the pooled feature maps 814A- 814D are stored in the memory as skip connections that skip part of the encoder-decoder network 800.

[0061] The bottleneck network 808 is coupled to the last stage of the encoding stages 806 (e.g., stage 806D in Figure 8), and continues to process the total number of encoding channels of the encoded feature map 812D of the last encoding stage 806D and generate a bottleneck feature map 816A (i.e., a first input feature map 816A to be used by a first decoding stage 810A). In an example, the bottleneck network 808 includes a first set of 3^3 CNN and Rectified Linear Unit (ReLU), a second set of 3 x3 CNN and ReLU, a global pooling network, a bilinear upsampling network, and a set of 1 x 1 CNN and ReLU. The encoded feature map 812D of the last encoding stage 806D is normalized (e.g., using a pooling layer), and fed to the first set of 3 x3 CNN and ReLU of the bottleneck network 808. A bottleneck feature map 816A is outputted by the set of 1 x 1 CNN and ReLU of the bottleneck network 808 and provided to the decoding stages 810.

[0062] The series of decoding stages 810 include an ordered sequence of decoding stages 810, e.g., stages 810A, 810B, 810D, and 810D, and have a decoding upsampling factor. Each decoding stage 810 generates a decoded feature map 818 having a feature resolution and a number of decoding channels. Among the decoding stages 810A-810D, the feature resolution is scaled up and the number of decoding channels is scaled down according to the decoding upsampling factor. In an example, the decoding upsampling factor is 2. A first decoded feature map 818A of a first decoding stage 810A has a first feature resolution (e.g., ’/sJ/’x’/sIF’) and a first number of (e.g., 8NCH’) decoding channels, and a second decoded feature map 818B of a second decoding stage 810B has a second feature resolution (e.g., V H’^ V W’) and a second number of (e.g., 4NCH’) decoding channels. A third decoded feature map 818C of a third decoding stage 810C has a third feature resolution (e.g., UH’xUIF’) and a third number of (e.g., 2NCH’) decoding channels, and a fourth decoded feature map 818D of a fourth decoding stage 810D has a fourth feature resolution (e.g., J/’x W’) related to a resolution of the noise map 804 and a fourth number of (e.g., NCH’ decoding channels.

[0063] For each decoding stage 810, the decoded feature map 818 is processed and provided as an input to a next encoding stage 806, except that the decoded feature map 818 of a last encoding stage 806 (e.g., stage 806D in Figure 8) is processed to generate the noise map 804. For example, the decoded feature map 818D of the last encoding stage 806 is processed by a 1 x 1 CNN 822 to generate the noise map 804, which is further combined with the input image 802 to reduce a noise level of the input image 802 via the entire encoderdecoder network 800. Additionally, each respective decoding stage 810 combines the pooled feature map 814 with an input feature map 816 of the respective decoding stage 810 using a set of neural networks 824. Each respective decoding stage 810 and the corresponding encoding stage 806 are symmetric with respect to the bottleneck network 808, i.e., separated from the bottleneck network 808 by the same number of decoding or encoding stages 810 or 806. In some embodiments, the pooled feature map 814 is purged from the memory after it is used by the respective decoding stage 810.

[0064] In various embodiments of this application, the input image 802 is obtained with associated metadata 522, and the metadata 522 includes an image characteristic 610 associated with a brightness level of the input image. An example of the image characteristic 610 is an ISO value 706. The image characteristic 610 of the input image 802 is applied in at least two aspects. In one aspect, the image characteristic 610 is applied with the sequence of encoding and decoding stages 806 and 810 and bottleneck network 808 of the encoderdecoder network 800 to generate the noise map 804. The encoding and decoding stages 806 and 810 and bottleneck network 808 generate a plurality of intermediate feature maps (e.g., encoded feature maps 812, pooled feature maps 814, decoded feature maps 818), and at least one of the plurality of intermediate feature maps is modified based on the image characteristic 610 before the at least one feature map is processed by a next image processing network in the encoder-decoder network 800. In another aspect, the image characteristic 610 is used to generate a denoising level 612 in a denoising control module 602, and the denoising level 612 is applied to adjust the noise map 804 generated by the encoder-decoder network 800. The adjusted noise map is further combined with the input image 802 to generate the output image 710. Specifically, in an example, the noise map 804 is multiplied by the denoising level 612 prior to being added to the input image 802 to generate the output image 710, and the output image 710 has better image quality (e.g., associated with a higher SNR) than the input image 802.

[0065] Figure 8B is a flow diagram of an example process 850 of modifying an intermediate feature map based on an image characteristic 610 of an input image 802, in accordance with some embodiments. As explained above, an example of the image characteristic 610 of the input image 802 is an ISO value 706. The ISO value 706 is applied with the sequence of encoding and decoding stages 806 and 810 and bottleneck network 808 of the encoder-decoder network 800 to generate the noise map 804. The encoding and decoding stages 806 and 810 and bottleneck network 808 generate a plurality of intermediate feature maps (e.g., encoded feature maps 812, pooled feature maps 814, decoded feature maps 818), and at least one of the plurality of intermediate feature maps is modified based on the ISO value 706 before the at least one intermediate feature map is processed by a next image processing network in the encoder-decoder network 800.

[0066] The ISO value 706 is normalized and used to scale each element of an intermediate feature map 852 of the encoder-decoder network 800 to generate a scaled feature map 854. The intermediate feature map 852 and the scaled feature map 854 are combined to generate a combined feature map 856, and the combined feature map 856 is provided to and processed by a next image processing network in the encoder-decoder network 800. In some embodiments, the intermediate feature map 852 and the scaled feature map 854 are combined by concatenating the intermediate feature map 852 and the scaled feature map 854 in a channel dimension to generate the combined feature map 856. Alternatively, in some embodiments, the intermediate feature map 852 and the scaled feature map 854 are combined by concatenating the intermediate feature map 852 and the scaled feature map 854 and processing the concatenated feature map using a network (e.g., a CNN) to generate the combined feature map 856.

[0067] It is noted that when an image characteristic 610 associated with a brightness level of an image is applied to modify intermediate feature map(s) generated by image processing networks, image quality can be conveniently improved efficiently, i.e., with a limited memory usage and power consumption. The sequence of image processing networks receive three inputs including a noisy input image 802, the image characteristic 610, and a denoising level 612. In an example, both the image characteristic 610 and denoising level 612 are associated with an ISO value 706. If the image characteristic 610 stored in metadata of the image is varied, an output image offers a different image quality level. [0068] Figure 9 is a flow diagram of an example image processing method 900, in accordance with some embodiments. For convenience, the method 900 is described as being implemented by an electronic device (e.g., a mobile phone 104C, AR glasses 104D, smart television device). Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed. The image processing method 900 is applied to reduce a noise level of an input image and enhance its image quality, e.g., under insufficient lighting conditions, when a camera 260 is moving fast and a three-dimensional noise reduction (3DNR) method is disabled.

[0069] Specifically, the electronic device obtains (902) an input image (e.g., in visual data 520) and associated image metadata 522, and the associated image metadata 522 includes an image characteristic 610 related to a brightness level of the input image. The electronic device generates (904) a noise map (e.g., noises map 708 and 804) from the input image using a sequence of image processing networks (e.g., networks 704). Specifically, one or more intermediate feature maps are generated (906) from the sequence of image processing networks, and the one or more intermediate feature maps are distinct from the noise map and includes a first feature map (e.g., feature maps 712A and 712B). The first feature map is modified (908) based on the image characteristic 610. The electronic device generates (910) an output image (e.g., image 710 in Figure 7) from the input image and the noise map. In some embodiments, the electronic device determines (912) a denoising level 612 from the image characteristic 610 and adjusts (914) the noise map using the denoising level 612. The adjusted noise map is used to generate the output image. In some embodiments, the image characteristic 610 includes an ISO value 706. In some embodiments, the image characteristic 610 includes a lux index, which is the ISO value 706 multiplied with an exposure time. [0070] In some embodiments, the electronic device modifies the first feature map by normalizing (916) the image characteristic 610, scaling (918) each element of the first feature map using the normalized image characteristic 610 to generate a scaled feature map (e.g., map 714), and combining (920) the first feature map and the scaled feature map to generate a combined feature map (e.g., feature map 712A’). The combined feature map is provided to and processed by a next image processing network. Further, in some embodiments, the electronic device concatenates the first feature map and the scaled feature map in a channel dimension, and process the concatenated feature map using a CNN 705 to generate the combined feature map.

[0071] In some embodiments, the sequence of image processing networks 704 includes a first network 704A, a second network 704B that follows the first network 704A, and a third network 704C that follows the second network 704B. The one or more intermediate feature maps 712 further include a second feature map 712B. the first and second feature maps 712A and 712B are generated from the first and second networks 704A and 704B, respectively. The electronic device further modifies (922) the second feature map based on the image characteristic 610 and processes (924) the modified second feature map with the third network 704C.

[0072] In some embodiments, the sequence of image processing networks 704 includes a first network 704A and a second network 704B that follows the first network 704A, and the first feature map 712A is generated from the first network 704A, wherein the one or more intermediate feature maps 712 include a second feature map 712B. The electronic device receives the modified first feature map 712A’ by the second network 704B and generates a second feature map 712B from the modified first feature map 712A’ using the second network 704B. Further, in some embodiments, the first network 704A includes a first convolutional neural network that receives the input image and generates the first feature map 712A. The second network includes at least one of a convolutional neural network 704C and an encoder-decoder network 704B. Further, in some embodiments, the second network includes the encoder-decoder network 704B and convolutional neural network 704C, and the noise map 708 is generated from the second network.

[0073] In some embodiments, the sequence of image processing networks includes an encoder-decoder network 800, and the encoder-decoder network further includes a series of encoding stages, a series of decoding stages, and a bottleneck network coupled between the series of encoding stages and the series of decoding stages, and the first feature map is generated by one of the encoding stages, decoding stages, and bottleneck network.

[0074] In some embodiments, the electronic device determines a pixel -wise LI -norm loss based on the output image and a ground truth image of the input image, and determines an edge loss between a first edge image and a second edge image. The first and second edge images include edge information of the output and ground truth images, respectively. The electronic device determines a content loss between a first semantic map extracted from the output image and a second semantic map extracted from the ground truth image, determines a comprehensive loss combining the pixel-wise LI -norm loss, edge loss, and content loss in a weighted manner, and trains the sequence of image processing networks using the comprehensive loss.

[0075] In some embodiments, each of the sequence of image processing networks includes a plurality of weights associated with a respective plurality of filters of each layer. The electronic device quantizes the plurality of weights based on a data format by maintaining the data format for the plurality of weights while training the sequence of image processing networks using a predefined loss function. Further, in some embodiments, the data format of the plurality of weights is selected based on a precision setting of an electronic device. The electronic device provides the sequence of image processing network including the quantized weights to the electronic device. Additionally, in some embodiments, the data format is selected from floal32. int8, uinl8. inti 6. and uintl6.

[0076] It should be understood that the particular order in which the operations in Figure 9 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to enhance image quality as described herein. Additionally, it should be noted that details of other processes described above with respect to Figure 5-8B are also applicable in an analogous manner to method 900 described above with respect to Figure 9. For brevity, these details are not repeated here.

[0077] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0078] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0079] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0080] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.