Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NEURAL NETWORK TRAINING METHOD AND APPARATUS, IMAGE PROCESSING METHOD AND APPARATUS
Document Type and Number:
WIPO Patent Application WO/2024/072250
Kind Code:
A1
Abstract:
Embodiments of this application provide a neural network training method and a neural network training apparatus, including: obtaining multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image, where the high-quality image and the low-quality image are images with different qualities depicting the same scenery; inputting the low-quality image into an enhancement network to generate an enhanced image; inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, where the at least two intermediate tensors are determined based on the enhancement network; determining a first loss function based on the high-quality image and the enhanced image; determining a second loss function based on the low-quality image and the degraded image; and updating the neural network based on the first loss function and the second loss function, where the neural network includes the enhancement network and the degradation network. According to the technical solution, the neural network training method can improve training efficiency and improve image enhancement quality.

Inventors:
ANISIMOVSKY VALERY VALERIEVICH (RU)
GNATYUK VITALY SERGEEVICH (RU)
VOSTOKOV SERGEY VITALYEVICH (RU)
CHERNYAEVA SOFYA ALEXANDROVNA (RU)
STROEV VYCHESLAV IGOREVICH (RU)
DUBYSHKIN IGNATII ARKADIEVICH (RU)
TELEGINA ANNA DMITRIEVNA (RU)
Application Number:
PCT/RU2022/000292
Publication Date:
April 04, 2024
Filing Date:
September 27, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
ANISIMOVSKY VALERY VALERIEVICH (RU)
International Classes:
G06T3/40; G06N3/0455; G06T5/00
Foreign References:
US20200387750A12020-12-10
Other References:
DASONG LI ET AL: "Learning Degradation Representations for Image Deblurring", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 August 2022 (2022-08-10), XP091291869
SHCHERBININ ANDREY ET AL: "Image Enhancement using Entwined Auto-Encoder CNN with Nested Convolution Kernels Scalable Voice and Video Conference View project Bayer SR View project Andrey Shcherbinin Samsung 13 PUBLICATIONS4 CITATIONS SEE PROFILE", THE 34TH CANADIAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 25 May 2021 (2021-05-25), pages 1 - 12, XP055915302, Retrieved from the Internet [retrieved on 20220425]
LYU KEJIE ET AL: "JSENet: A deep convolutional neural network for joint image super-resolution and enhancement", NEUROCOMPUTING, ELSEVIER, AMSTERDAM, NL, vol. 489, 1 January 2022 (2022-01-01), pages 570 - 583, XP087099310, ISSN: 0925-2312, [retrieved on 20220101], DOI: 10.1016/J.NEUCOM.2021.12.071
Attorney, Agent or Firm:
SOJUZPATENT (RU)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A neural network training method, comprising: obtaining multiple training image pairs, wherein each pair of the multiple training image pairs comprises a high-quality image and a low-quality image, wherein the high-quality image and the low-quality image are images with different qualities depicting the same scenery; inputting the low-quality image into an enhancement network to generate an enhanced image; and inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, wherein the at least two intermediate tensors are determined based on the enhancement network; determining a first loss function based on the high-quality image and the enhanced image; and determining a second loss function based on the low-quality image and the degraded image; and updating the neural network based on the first loss function and the second loss function, wherein the neural network comprises the enhancement network and the degradation network.

2. The method according to claim 1, wherein the enhancement network comprises a first encoder and a first decoder; the at least two intermediate tensors comprise a first intermediate tensor and a second intermediate tensor, wherein the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

3. The method according to claim 2, wherein the degradation network comprises a first subnetwork, the inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image comprises: inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, wherein convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; inputting the high-quality image into the degradation network; and convolving the high-quality image with per-pixel kernels to generate the degraded image, wherein the per-pixel kernels are determined from the per-pixel kernel map.

4. The method according to claim 2 or 3, wherein a height x width size of the first intermediate tensor is 1 * 1 or 2*2 or 3 x3; or a height x width size of the first intermediate tensor is the same as a height x width size of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

5. The method according to any one of claims 2 to 4, wherein the method further comprises: injecting a processed result of the first intermediate tensor into at least one network layer in the first decoder, wherein the processed result of the first intermediate tensor has a spatial size of 1 x 1.

6. The method according to claim 3, wherein the first subnetwork comprises a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map comprises: inputting the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor; and inputting the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, wherein convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

7. The method according to claim 6, wherein the first subnetwork further comprises a kernel embedding map estimation subnetwork, the inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor comprises: inputting the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

8. The method according to claim 6, wherein the enhancement network further comprises a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the method further comprises: inputting at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; and concatenating the second intermediate tensor with one of network layer tensors in the first decoder.

9. The method according to any one of claims 6 to 8, wherein a height x width size of the local component tensor is the same as a height x width size of the high-quality -image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32. t

10. The method according to any one of claims 2 to 9, wherein skip connections are provided between the first encoder and the first decoder.

11. The method according to any one of claims 1 to 10, wherein the enhancement network follows a U-net design.

12. An image processing method, comprising: obtaining a to-be-processed image; processing the to-be-processed image with an enhancement network, wherein the enhancement network is obtained by training a neural network comprising the enhancement network and a degradation network based on multiple training image pairs, wherein each pair of the multiple training image pairs comprises a high-quality image and a low-quality image; the neural network is trained based on a first loss function and a second loss function, wherein the first loss function is determined based on the high-quality image and an enhanced image, the second loss function is determined based on the low-quality image and a degraded image; and the enhanced image is generated by the enhancement network with an input of the low- quality image, the degraded image is generated by the degradation network with an input of the high-quality image and at least two intermediate tensors, wherein the at least two intermediate tensors are determined based on the enhancement network.

13. The method according to claim 12, wherein the enhancement network comprises a first encoder and a first decoder; the at least two intermediate tensors comprise a first intermediate tensor and a second intermediate tensor, wherein the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

14. The method according to claim 13, wherein the degradation network comprises a first subnetwork, the degraded image is obtained by convolving the high-quality image with per- pixel kernels, the per-pixel kernels are determined from a per-pixel kernel map, which is generated by the first subnetwork with inputs of the first intermediate tensor and the second intermediate tensor, wherein convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor.

15. The method according to claim 13 or 14, wherein a height x width size of the first intermediate tensor is 1 x 1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as a height x width size of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

16. The method according to any one of claims 13 to 15, wherein a processed result of the first intermediate tensor is further injected into at least one network layer in the first decoder, wherein the processed result of the first intermediate tensor has a spatial size of 1 x 1.

17. The method according to claim 14, wherein the first subnetwork comprises a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the per- pixel kernel map is generated by the kernel map estimation subnetwork with inputs of a local component tensor and multiple dynamic pointwise convolution weights, wherein convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights; and the multiple dynamic pointwise convolution weights are generated by the pointwise convolution weights estimation subnetwork with an input of the first intermediate tensor, and the local component tensor is determined based on the second intermediate tensor.

18. The method according to claim 17, wherein the first subnetwork further comprises a kernel embedding map estimation subnetwork, the local component tensor is determined by the kernel embedding map estimation subnetwork with an input of the second intermediate tensor.

19. The method according to claim 17, wherein the enhancement network further comprises a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the second intermediate tensor is generated by the kernel embedding map estimation subnetwork with an input of at least one of network layer tensors in the first decoder; the second intermediate tensor is concatenated with one of network layer tensors in the first decoder.

20. The method according to any one of claims 17 to 19, wherein a height x width size of the local component tensor is the same as a height x width size of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

21. The method according to any one of claims 13 to 20, wherein skip connections are provided between the first encoder and the first decoder.

22. The method according to any one of claims 12 to 21 , wherein the enhancement network follows a U-net design.

23. A neural network training apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory, and when executing the computer program instructions, configured to: obtain multiple training image pairs, wherein each pair of the multiple training image pairs comprises a high-quality image and a low-quality image, wherein the high-quality image and the low-quality image are images with different qualities depicting the same scenery; input the low-quality image into an enhancement network to generate an enhanced image; and input the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, wherein the at least two intermediate tensors are determined based on the enhancement network; determine a first loss function based on the high-quality image and the enhanced image; and determine a second loss function based on the low-quality image and the degraded image; and update the neural network based on the first loss function and the second loss function, wherein the neural network comprises the enhancement network and the degradation network.

24. The neural network training apparatus according to claim 23, wherein the enhancement network comprises a first encoder and a first decoder; the at least two intermediate tensors comprise a first intermediate tensor and a second intermediate tensor, wherein the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

25. The neural network training apparatus according to claim 24, wherein the degradation network comprises a first subnetwork, when executing the computer program instructions, the processor is configured to: input the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, wherein convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; input the high-quality image into the degradation network; and convolve the high-quality image with per-pixel kernels to generate the degraded image, wherein the per-pixel kernels are determined from the per-pixel kernel map.

26. The neural network training apparatus according to claim 24 or 25, wherein a height x width size of the first intermediate tensor is 1 ^1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as a height x width size of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

27. The neural network training apparatus according to any one of claims 24 to 26, wherein when executing the computer program instructions, the processor is further configured to inject a processed result of the first intermediate tensor into at least one network layer in the first decoder, wherein the processed result of the first intermediate tensor has a spatial size of 1 x 1.

28. The neural network training apparatus according to claim 25, wherein the first subnetwork comprises a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, when executing the computer program instructions, the processor is configured to: input the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; input the second intermediate tensor into the first subnetwork and determine a local component tensor based on the second intermediate tensor; and input the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, wherein convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

29. The neural network training apparatus according to claim 28, wherein the first subnetwork further comprises a kernel embedding map estimation subnetwork, and when executing the computer program instructions, the processor is configured to input the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

30. The neural network training apparatus according to claim 28, wherein the enhancement network further comprises a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, and when executing the computer program instructions, the processor is further configured to: input at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; and concatenate the second intermediate tensor with one of network layer tensors in the first decoder.

31. The neural network training apparatus according to any one of claims 28 to 30, wherein a height x width size of the local component tensor is the same as a height x width size of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or

32.

32. The neural network training apparatus according to any one of claims 24 to 31 , wherein skip connections are provided between the first encoder and the first decoder.

33. The neural network training apparatus according to any one of claims 23 to 32, wherein the enhancement network follows a U-net design.

34. An image processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory, and when executing the computer program instructions, configured to: obtain a to-be-processed image; process the to-be-processed image with an enhancement network, wherein the enhancement network is obtained by training a neural network comprising the enhancement network and a degradation network based on multiple training image pairs, wherein each pair of the multiple training image pairs comprises a high-quality image and a low-quality image; the neural network is trained based on a first loss function and a second loss function, wherein the first loss function is determined based on the high-quality image and an enhanced image, the second loss function is determined based on the low-quality image and a degraded image; the enhanced image is generated by the enhancement network with an input of the low-quality image, the degraded image is generated by the degradation network with inputs of the high- quality image and at least two intermediate tensors, wherein the at least two intermediate tensors are determined based on the enhancement network.

35. The image processing apparatus according to claim 34, wherein the enhancement network comprises a first encoder and a first decoder; the at least two intermediate tensors comprise a first intermediate tensor and a second intermediate tensor, wherein the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

36. The image processing apparatus according to claim 35, wherein the degradation network comprises a first subnetwork, the degraded image is obtained by convolving the high- quality image with per-pixel kernels, the per-pixel kernels are determined from a per-pixel kernel map, which is generated by the first subnetwork with inputs of the first intermediate tensor and the second intermediate tensor, wherein convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor.

37. The image processing apparatus according to claim 35 or 36, wherein a height * width size of the first intermediate tensor is 1 ><1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as a height x width size of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

38. The image processing apparatus according to any one of claims 35 to 37, wherein a processed result of the first intermediate tensor is further injected into at least one network layer in the first decoder, wherein the processed result of the first intermediate tensor has a spatial size of 1 x 1.

39. The image processing apparatus according to claim 36, wherein the first subnetwork comprises a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the per-pixel kernel map is generated by the kernel map estimation subnetwork with inputs of a local component tensor and multiple dynamic pointwise convolution weights, wherein convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights; the multiple dynamic pointwise convolution weights are generated by the pointwise convolution weights estimation subnetwork with an input of the first intermediate tensor, the local component tensor is determined based on the second intermediate tensor.

40. The image processing apparatus according to claim 39, wherein the first subnetwork further comprises a kernel embedding map estimation subnetwork, and the local component tensor is determined by the kernel embedding map estimation subnetwork with an input of the second intermediate tensor.

41. The image processing apparatus according to claim 39, wherein the enhancement network further comprises a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the second intermediate tensor is generated by the kernel embedding map estimation subnetwork with an input of at least one of network layer tensors in the first decoder; and the second intermediate tensor is concatenated with one of network layer tensors in the first decoder.

42. The image processing apparatus according to any one of claims 39 to 41, wherein a height x width size of the local component tensor is the same as a height x width size of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

43. The image processing apparatus according to any one of claims 35 to 42, wherein skip connections are provided between the first encoder and the first decoder.

44. The image processing apparatus according to any one of claims 34 to 43, wherein the enhancement network follows a U-net design.

45. A computer-readable storage medium, wherein the computer-readable storage medium stores instructions, and when the instructions run on a device, the device is enabled to perform the method according to any one of claims 1 to 11 and 12 to 22.

46. A computer program product, wherein when the computer program product runs on a device, the device is enabled to perform the method according to any one of claims 1 to Hand 12 to 22.

47. A chip system, comprising a memory and a processor, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that a device on which the chip system is disposed performs the method according to any one of claims 1 to 11 and 12 to 22.

Description:
NEURAL NETWORK TRAINING METHOD AND APPARATUS, IMAGE PROCESSING METHOD AND APPARATUS

TECHNICAL FIELD

[0001] Embodiments of the present application relate to the field of artificial intelligence, and more specifically, to a neural network training method and apparatus, an image processing method and apparatus.

BACKGROUND

[0002] With the development of terminals, taking photos with terminals, especially handheld terminals equipped with cameras is ubiquitous. However, the quality of taken photos, especially detail quality often is affected by various degradation sources, along with low-quality camera optics and motion blur, which are prominent degradation sources in real life. So the task of image enhancement is one of the most important aspects in the image processing field, especially for those images taken by cameras of hand-held terminals such as cell phone cameras. [0003] Image enhancement methods include single image super resolution (SISR) and single image deblurring. The former is aimed at increasing image resolution, and at the same time, restoring image details smeared by the camera lens’ point spread function (PSF). The latter is aimed at restoring image details blurred due to the camera motion and/or scene object motion.

[0004] Most of the image enhancement methods are based on convolutional neural networks (CNNs). However, existing image enhancement methods either do not involve any degradation information about the degradation process or assume degradation kernels belong to a family of anisotropic Gaussian blur kernels, convolutions of classic downscaling filters, which is not consistent with complex degradation kernels in real world. SUMMARY

[0005] Embodiments of this application provide a neural network training method and apparatus, an image processing method and apparatus, which allow simultaneous learning of both image enhancement and image degradation, improving the quality of an output image.

[0006] According to a first aspect, an embodiment of this application provides a neural network training method, including: obtaining multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image, where the high-quality image and the low-quality image are images with different qualities depicting the same scenery; inputting the low-quality image into an enhancement network to generate an enhanced image; inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, where the at least two intermediate tensors are determined based on the enhancement network; determining a first loss function based on the high-quality image and the enhanced image; determining a second loss function based on the low-quality image and the degraded image; and updating the neural network based on the first loss function and the second loss function, where the neural network includes the enhancement network and the degradation network.

[0007] According to the above-mentioned technical solution, the training process involves simultaneous learning of the loss function of the enhancing loss and the degradation loss. Thus, the information about the degradation process is extracted from the low-quality image with the assistance of the degradation network to improve the quality of the output image when using the enhancement network after training is finished to process image.

[0008] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[0009] According to the above-mentioned technical solution, the first intermediate tensor is extracted from the bottleneck tensor in the first encoder in the enhancement network, and thus contains global information about the degradation of the low-quality image. The second intermediate tensor is extracted from several network layers in the first decoder, which is able to be processed to become the local component tensor, containing local information about the degradation of the low-quality image. By inputting this information into the degradation network, the degradation process is better understood, thus improving the quality of the image in the enhancement network after finishing the training.

[0010] The enhancement network is applicable to many CNN-based networks including an encoder and a decoder (the first encoder and the first decoder in this application), which allows high flexibility to target hardware and software requirements.

[0011] In one optional implementation, the degradation network includes a first subnetwork, the inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image includes: inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; inputting the high-quality image into the degradation network; and convolving the high-quality image with per-pixel kernels to generate the degraded image, where the per-pixel kernels are determined from the per-pixel kernel map.

[0012] According to the above-mentioned technical solution, the per-pixel kernel map is estimated by combining the global information and local information about the degradation process together. It is a spatially variant degradation kernel map estimated by the degradation network. The decoupling of the global information and local information about the degradation process dramatically decreases internal dimensionality of the estimation of spatially variant degradation kernel map (per-pixel kernel map) task, thereby greatly mitigating its ill-posedness and making the per-pixel kernel map more consistent.

[0013] In one optional implementation, a height x width size of the first intermediate tensor is 1 x1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as a height x width size of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[0014] According to the above-mentioned technical solution, the first intermediate tensor is directly extracted based on part of the bottleneck tensor or processed to be a tensor with a high dimension but a relatively small size. [0015] In one optional implementation, the method further includes: injecting a processed result of the first intermediate tensor into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1.

[0016] According to the above-mentioned technical solution, the first intermediate tensor containing global information about the degradation process is processed to be injected into the first decoder of the enhancement network, which does not require much computation. The first intermediate tensor (global component tensor) incorporates information about image degradation which is specific to the entire image (e.g. information about the camera lens, camera motion, etc.). So, if the first decoder of the enhancement network is supplied with this information, it can better restore details of a high-quality image. On the other hand, since the global component tensor has a small spatial size (i.e., height x width), its estimation (and injection of its compressed variant) does not require much computation. So, we improve the enhanced image quality at the cost of a moderate increase in computational complexity.

[0017] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map includes: inputting the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor; and inputting the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

[0018] The kernel map estimation subnetwork is structured as an MLP including several pointwise convolutions, which allows a better quality of the output enhanced image after the training is over due to richer space of estimated degradation processes.

[0019] The pointwise convolution weights estimation subnetwork generates weights of dynamic convolution, which is able to improve the quality of the output enhanced image after the training is over due to better adaption to each particular input image. [0020] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, and the inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor includes: inputting the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

[0021] According to the above-mentioned technical solution, the degradation network includes three subnetworks: a pointwise convolution weights estimation subnetwork, a kernel map estimation subnetwork, and a kernel embedding map estimation subnetwork. The first intermediate tensor is input into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights. The second intermediate tensor is input into the kernel embedding map estimation subnetwork to generate the local component tensor. The local component tensor is then input into the kernel map estimation subnetwork to generate the per-pixel kernel map, where the dynamic pointwise convolution weights are also input into the kernel map estimation subnetwork to be used as convolution weights. The per- pixel kernel map is the processed to be per-pixel kernels, which is further used to convolve with the high-quality image to generate the degraded image. The degradation information learned from the degradation process will improve the performance of the enhancement network.

[0022] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, and the method further includes: inputting at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; and concatenating the second intermediate tensor with one of network layer tensors in the first decoder.

[0023] According to the above-mentioned technical solution, the kernel embedding map estimation subnetwork is included in the enhancement network and the second intermediate tensor input into the degradation network is the local component tensor used in the kernel map estimation network. The kernel embedding map estimation subnetwork is for example not very complex, decreasing the computation cost in the training process.

[0024] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[0025] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[0026] According to the above-mentioned technical solution, skip connections are provided between the first encoder and the first decoder to alleviate the degradation of the network and problems of vanishing and exploding gradients, improving the performance of the neural network.

[0027] In one optional implementation, the enhancement network follows a U-net design.

[0028] According to a second aspect, an embodiment of this application provides an image processing method, including: obtaining a to-be-processed image; processing the to-be- processed image with an enhancement network, where the enhancement network is obtained by training a neural network including the enhancement network and a degradation network based on multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image. The neural network is trained based on a first loss function and a second loss function, where the first loss function is determined based on the high-quality image and an enhanced image, and the second loss function is determined based on the low-quality image and a degraded image. The enhanced image is generated by the enhancement network with an input of the low-quality image, the degraded image is generated by the degradation network with inputs of the high-quality image and at least two intermediate tensors, where the at least two intermediate tensors are determined based on the enhancement network.

[0029] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[0030] In one optional implementation, the degradation network includes a first subnetwork, the degraded image is obtained by convolving the high-quality image with per-pixel kernels, the per-pixel kernels are determined from the per-pixel kernel map, which is generated by the first subnetwork with inputs of the first intermediate tensor and the second intermediate tensor, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor.

[0031] In one optional implementation, a height x width size of the first intermediate tensor is 1 x1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[0032] In one optional implementation, a processed result of the first intermediate tensor is further injected into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1.

[0033] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the per- pixel kernel map is generated by the kernel map estimation subnetwork with inputs of a local component tensor and multiple dynamic pointwise convolution weights, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights; the multiple dynamic pointwise convolution weights are generated by the pointwise convolution weights estimation subnetwork with an input of the first intermediate tensor, and the local component tensor is determined based on the second intermediate tensor.

[0034] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, the local component tensor is determined by the kernel embedding map estimation subnetwork with an input of the second intermediate tensor.

[0035] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the second intermediate tensor is generated by the kernel embedding map estimation subnetwork with an input of at least one of network layer tensors in the first decoder; and the second intermediate tensor is concatenated with one of network layer tensors in the first decoder. [0036] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[0037] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[0038] In one optional implementation, the enhancement network follows a U-net design.

[0039] According to a third aspect, an embodiment of this application provides a neural network training apparatus, including: a memory storing computer program instructions; and a processor coupled to the memory, and when executing the computer program instructions, configured to: obtain multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image, where the high-quality image and the low-quality image are images with different qualities depicting the same scenery; input the low-quality image into an enhancement network to generate an enhanced image; input the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, where the at least two intermediate tensors are determined based on the enhancement network; determine a first loss function based on the high-quality image and the enhanced image; determine a second loss function based on the low-quality image and the degraded image; and update the neural network based on the first loss function and the second loss function, where the neural network includes the enhancement network and the degradation network.

[0040] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[0041] In one optional implementation, the degradation network includes a first subnetwork, when executing the computer program instructions, the processor is specifically configured to: input the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; input the high-quality image into the degradation network; and convolve the high-quality image with per-pixel kernels to generate the degraded image, where the per-pixel kernels are determined from the per-pixel kernel map.

[0042] In one optional implementation, a height x width size of the first intermediate tensor is 1 x 1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[0043] In one optional implementation, when executing the computer program instructions, the processor is further configured to inject a processed result of the first intermediate tensor into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1.

[0044] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, when executing the computer program instructions, the processor is specifically configured to: input the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; input the second intermediate tensor into the first subnetwork and determine a local component tensor based on the second intermediate tensor; and input the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

[0045] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, when executing the computer program instructions, the processor is specifically configured to input the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

[0046] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, when executing the computer program instructions, the processor is further configured to: input at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; and concatenate the second intermediate tensor with one of network layer tensors in the first decoder.

[0047] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32. [0048] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[0049] In one optional implementation, the enhancement network follows a U-net design.

[0050] According to a fourth aspect, an embodiment of this application provides an image processing apparatus, including: a memory storing computer program instructions; and a processor coupled to the memory, and when executing the computer program instructions, configured to: obtain a to-be-processed image; process the to-be-processed image with an enhancement network, where the enhancement network is obtained by training a neural network including the enhancement network and a degradation network based on multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image. The neural network is trained based on a first loss function and a second loss function, where the first loss function is determined based on the high-quality image and an enhanced image, the second loss function is determined based on the low-quality image and a degraded image. The enhanced image is generated by the enhancement network with an input of the low-quality image, and the degraded image is generated by the degradation network with inputs of the high-quality image and at least two intermediate tensors, where the at least two intermediate tensors are determined based on the enhancement network.

[0051] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[0052] In one optional implementation, the degradation network includes a first subnetwork, the degraded image is obtained by convolving the high-quality image with per-pixel kernels, the per-pixel kernels are determined from a per-pixel kernel map, which is generated by the first subnetwork with inputs of the first intermediate tensor and the second intermediate tensor, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor.

[0053] In one optional implementation, a height x width size of the first intermediate tensor is 1 x 1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[0054] In one optional implementation, a processed result of the first intermediate tensor is further injected into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1.

[0055] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the per- pixel kernel map is generated by the kernel map estimation subnetwork with inputs of a local component tensor and multiple dynamic pointwise convolution weights, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights; the multiple dynamic pointwise convolution weights are generated by the pointwise convolution weights estimation subnetwork with an input of the first intermediate tensor, and the local component tensor is determined based on the second intermediate tensor.

[0056] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, the local component tensor is determined by the kernel embedding map estimation subnetwork with an input of the second intermediate tensor.

[0057] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the second intermediate tensor is generated by the kernel embedding map estimation subnetwork with an input of at least one of network layer tensors in the first decoder; the second intermediate tensor is concatenated with one of network layer tensors in the first decoder.

[0058] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[0059] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[0060] In one optional implementation, the enhancement network follows a U-net design. [0061] According to a fifth aspect, an embodiment of this application provides a computer- readable storage medium, including instructions. When the instructions run on a computer, the computer is enabled to perform the method in the first aspect or any optional implementation of the first aspect and the method in the second aspect or any optional implementation of the second aspect.

[0062] According to a sixth aspect, an electronic device is provided, including a processor and a memory. The processor is connected to the memory. The memory is configured to store instructions and the processor is configured to execute the instructions. When the processor executes the instructions stored in the memory, the processor is enabled to perform the method in the first aspect or any optional implementation of the first aspect and the method in the second aspect or any optional implementation of the second aspect.

[0063] According to a seventh aspect, a chip system is provided, where the chip system includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that an electronic device on which the chip system is disposed performs the method in the first aspect or any optional implementation of the first aspect and the method in the second aspect or any optional implementation of the second aspect.

[0064] According to an eighth aspect, a computer program product is provided, where when the computer program product runs on an electronic device, the electronic device is enabled to perform the method in the first aspect or any optional implementation of the first aspect and the method in the second aspect or any optional implementation of the second aspect.

DESCRIPTION OF DRAWINGS

[0065] FIG.1 is a schematic diagram of a structure of a system architecture according to an embodiment of this application.

[0066] FIG. 2 is a schematic diagram of a convolutional neural network.

[0067] FIG. 3 is a schematic diagram of a hardware structure of a chip according to an embodiment of this application.

[0068] FIG. 4 is a schematic flowchart of a neural network training method 400 according to an embodiment of this application.

[0069] FIG. 5 is a schematic flowchart of an image processing method 500 according to an embodiment of this application.

[0070] FIG. 6 is a schematic diagram of a neural network of the training process used in the method 400 according to an embodiment of this application.

[0071] FIG. 7 is the network of the inference process corresponding to the neural network of the training process in FIG. 6.

[0072] FIG. 8 is a schematic diagram of another neural network of the training process used in method 400 according to an embodiment of this application.

[0073] FIG. 9 is the network of the inference process corresponding to the neural network of the training process in FIG. 8.

[0074] FIG. 10 is a schematic diagram of another neural network of the training process used in method 400 according to an embodiment of this application.

[0075] FIG. 11 is the network of the inference process corresponding to the neural network of the training process in FIG. 10.

[0076] FIG. 12 is a schematic block diagram of a neural network training apparatus 1200.

[0077] FIG. 13 is a schematic block diagram of an image processing apparatus 1300.

[0078] FIG. 14 is a schematic diagram of a hardware structure of a neural network training apparatus 1400.

[0079] FIG. 15 is a schematic diagram of a hardware structure of an image processing apparatus 1500.

DESCRIPTION OF EMBODIMENTS

[0080] The following describes the technical solutions in this application with reference to the accompanying drawings.

[0081] Definitions:

[0082] Unless otherwise stated or implicit from context the following terms and phrases have the meanings provided below.

[0083] The embodiments of this application relate to the mass application of the convolutional neural network. For ease of understanding, the following first describes related concepts such as a related term and the convolutional neural network in the embodiments of this application.

[0084] ( 1 ) Neural Network

[0085] The neural network may include a neuron, and the neuron may be an operation unit whose input is x s and an intercept is 1. The output of the operation unit may be shown as a formula (1):

[0086] Herein, s=l, 2, . . . , n, n is a natural number greater than 1, Ws represents a weight of x s , and b represents a bias of the neuron, f represents an activation function of the neuron, where the activation function is used to introduce a nonlinear characteristic into the neural network, to convert an input signal in the neuron into an output signal. The output signal of the activation function may be used as an input of a next convolutional layer, and the activation function may be a sigmoid function. The neural network is a network constituted by connecting a plurality of single neurons together. To be specific, an output of one neuron may be an input of another neuron. An input of each neuron may be connected to a local receptive field of a previous layer, to extract a feature of the local receptive field. The local receptive field may be a region including several neurons.

[0087] (2) Deep Neural Network

[0088] The deep neural network (DNN), also referred to as a multi-layer neural network, may be understood as a neural network having a plurality of hidden layers. The DNN is divided based on positions of different layers. Neural networks inside the DNN may be classified into three types: an input layer, a hidden layer, and an output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the middle layer is the hidden layer. Layers are fully connected. To be specific, any neuron in the i th layer is necessarily connected to any neuron in the (i+l) th layer.

[0089] Although the DNN seems complex, the work of each layer is actually not complex, and is simply expressed by the following linear relational expression: y = a (IT ■ x + b), x represents an input vector, y represents an output vector, b represents a bias vector, W represents a weight matrix (which is also referred to as a coefficient), and a( ) represents an activation function. In each layer, only such a simple operation is performed on the input vector x to obtain the output vector y . Due to a large quantity of DNN layers, quantities of coefficients W and bias vectors b are also large. These parameters are defined in the DNN as follows. The coefficient W is taken as an example, it is assumed that in a three-layer DNN, a linear coefficient from a fourth neuron in a second layer to a second neuron in a third layer is defined as f/ 2 4- A superscript 3 represents a number of a layer in which the coefficient W is located, and a subscript corresponds to an index 2 of the third layer for the output and an index 4 of the second layer for the input.

[0090] In conclusion, a coefficient from a kth neuron in an (L- 1 ) th layer to a j th neuron in an L th layer is defined as 14^ •

[0091] It should be noted that the input layer has no parameter W. In the deep neural network, more hidden layers make the network more capable of describing a complex case in the real world. Theoretically, a model with more parameters has higher complexity and a larger “capacity”. It indicates that the model can complete a more complex learning task. Training of the deep neural network is a process of learning a weight matrix, and a final objective of the training is to obtain a weight matrix of all layers of a trained deep neural network (a weight matrix formed by vectors W of many layers).

[0092] (3) Convolutional Neural Network

[0093] A convolutional neural network (CNN) is a deep convolutional neural network with a convolutional structure. The CNN includes a feature extractor constituted by a convolutional layer and a sub-sampling layer. The feature extractor may be considered a filter. A convolution process may be considered as using a trainable filter to perform convolution on an input image or a convolutional feature plane, to output a convolutional feature plane. The convolutional feature plane may also be referred to as a feature map. The convolutional layer is a neuron layer that is in the convolutional neural network and that performs convolution processing on an input signal. At the convolutional layer in the convolutional neural network, one neuron may be connected only to some neurons of the adjacent layer. One convolutional layer usually includes several feature planes, and each feature plane may include some rectangularly arranged neurons. Neurons in a same feature plane share a weight, and a weight matrix corresponding to the weight shared herein is a convolution kernel. Weight sharing may be understood as that an image information extraction manner is irrelevant to a location. A principle implied herein is that statistical information of a part of an image is the same as that of other parts. This means that image information learned in a part can also be used in another part. Therefore, the same learned image information can be used for all locations in the image. At a same convolutional layer, a plurality of convolution kernels may be used to extract different image information. Usually, a larger quantity of convolution kernels indicates richer image information reflected by a convolution operation.

[0094] The convolution kernel may be initialized in a form of a random-size matrix. In a process of training the convolutional neural network, the convolution kernel may obtain an appropriate weight through learning. In addition, a direct benefit brought by the weight sharing is that connections between layers of the convolutional neural network are reduced and an overfitting risk is lowered.

[0095] (4) Residual Network

[0096] A residual network is a deep convolutional network proposed in 2015. Compared with a traditional convolutional neural network, the residual network is easier to optimize and can improve the accuracy by increasing a considerable depth. The core of the residual network is to solve the side effect (degeneration problem) of increasing the depth, which can improve the network performance by simply increasing the network depth. Residual networks generally contain many sub-modules with the same structure.

[0097] (5) Pixel Value

[0098] A pixel value of an image may be a red-green-blue (RGB) color value, and the pixel value may be a long integer representing a color. For example, a pixel value is 256*Red+100xGreen+76><Blue, where Blue represents a blue component, Green represents a green component, and Red represents a red component. In each color component, a smaller value indicates lower luminance, and a larger value indicates higher luminance. For a grayscale image, a pixel value may be a grayscale value.

[0099] (6) Loss Function

[00100] In a process of training a convolutional neural network, because it is expected that an output of the convolutional neural network is maximally close to a value that actually wants to be predicted, a current predicted value of the network and an actually desired target value may be compared, and then a weight vector of each layer of the convolutional neural network is updated based on a difference between the predicted value and the target value (certainly, there is usually an initialization process before the first update, to be specific, parameters are preconfigured for all the layers of the convolutional neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to lower the predicted value, and the adjustment is continuously performed until the convolutional neural network can predict the actually desired target value or a value that is very close to the actually desired target value. Therefore, “how to obtain, through comparison, a difference between a predicted value and a target value” needs to be predefined. This is the loss function or an objective function. The loss function and the objective function are important equations used to measure the difference between the predicted value and the target value. The loss function is used as an example. A larger output value (loss) of the loss function indicates a larger difference. Therefore, the training for the convolutional neural network becomes a process of minimizing the loss as much as possible.

[00101] (7) Back Propagation Algorithm

[00102] A convolutional neural network may correct a value of a parameter in the convolutional neural network in a training process by using an error back propagation (back propagation, BP) algorithm, so that an error loss between a predicted value output by the convolutional neural network and an actually desired target value is increasingly small. Specifically, an input signal is transferred forward until an error loss occurs at an output, and the parameter in the initial convolutional neural network is updated by using back propagation error loss information, to make the error loss converge. The back propagation algorithm is a back propagation motion dominated by an error loss, and is intended to obtain an optimal parameter of the convolutional neural network, for example, a weight matrix, namely, a convolution kernel of a convolutional layer.

[00103] As shown in FIG. 1, an embodiment of this application provides a system architecture 100. In FIG. 1, a data collection device 160 is configured to collect training data. For the image processing method in the embodiments of this application, the training data may include pairs of aligned low-quality and high-quality images depicting the same scenery.

[00104] After collecting the training data, the data collection device 160 stores the training data into a database 130. A training device 120 obtains a target model/rule 101 by performing training based on the training data maintained in the database 130.

[00105] The following describes the obtaining of the target model/rule 101 by the training device 120 based on the training data.

[00106] The target model/rule 101 can be used to implement the image processing method in the embodiments of this application. To be specific, after related preprocessing is performed on a to-be-processed image, a to-be-processed image obtained after related preprocessing is input to the target model/rule 101, to obtain a processed high-quality result of the image. The target model/rule 101 in this embodiment of this application may be specifically a neural network. It should be noted that, in actual application, the training data maintained in the database 130 may not all be collected by the data collection device 160, or may be received and obtained from another device. It should be further noted that the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained in the database 130, or may alternatively obtain training data from a cloud or another place to perform model training. The foregoing description should not be construed as a limitation on the embodiments of this application.

[00107] The target model/rule 101 obtained through training by the training device 120 may be applied to different systems or devices, for example, an execution device 110 shown in FIG. 1. The execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (AR)/virtual reality (VR) terminal, or a vehicle-mounted terminal, or may be a server, a cloud device, or the like. In FIG. 1, an input/output (I/O) interface 112 is configured for the execution device 110, to exchange data with an external device. A user may input data to the I/O interface 112 by using a client device 140. In this embodiment of this application, the input data may include a to-be-processed image input by the client device.

[00108] A preprocessing module 113 and a preprocessing module 114 are configured to perform preprocessing based on the input data (for example, the to-be-processed image) received by the I/O interface 112. In this embodiment of this application, there may be no preprocessing module 113 and no preprocessing module 114 (or there may be only one of the preprocessing modules), and the input data is processed directly by using a calculation module 111.

[00109] In a process in which the execution device 110 performs preprocessing on the input data or the calculation module 111 of the execution device 110 performs related processing such as calculation, the execution device 110 may invoke data, code, and the like in a data storage system 150 for corresponding processing, and may also store data, instructions, and the like obtained through corresponding processing into the data storage system 150.

[00110] Finally, the I/O interface 112 returns a processing result, for example, the obtained high-quality result of the to-be-processed image, to the client device 140, to provide the processing result to the user.

[00111] It should be noted that the training device 120 may generate corresponding target models/rules 101 based on different training data for different objectives or different tasks. The corresponding target models/rules 101 may be used to implement the foregoing objectives or complete the foregoing tasks, to provide required results to the user.

[00112] In a case shown in FIG. 1 , the user may manually provide input data by operating an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send input data to the I/O interface 112. If authorization of the user needs to be obtained for requesting the client device 140 to automatically send the input data, the user may set corresponding permission on the client device 140. The user may view, on the client device 140, a result output by the execution device 110, and a specific presentation form may be a specific manner such as display, sound, or action. The client device 140 may also serve as a data collection end to collect, as new sample data, input data that is input to the I/O interface 112 and an output result that is output from the I/O interface 112 that are shown in the figure, and store the new sample data into the database 130. Certainly, the client device 140 may alternatively not perform collection, but the I/O interface 112 directly stores, as new sample data into the database 130, input data that is input to the I/O interface 112 and an output result that is output from the I/O interface 112 that are shown in the figure.

[00113] It should be noted that FIG. 1 is merely a schematic diagram of a system architecture according to an embodiment of this application. A location relationship between devices, components, modules, and the like shown in the figure does not constitute any limitation. For example, in FIG. 1, the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may alternatively be disposed in the execution device 110.

[00114] As shown in FIG. 1, the target model/rule 101 is obtained through training by the training device 120. In this embodiment of this application, the target model/rule 101 is the enhancement network to be discussed later, part of the model/rule to be trained. And during the training process, the model/rule to be trained includes a degradation network and an enhancement network. The enhancement network may be a neural network in this application. Specifically, the neural network provided in the embodiments of this application may be a CNN, a deep convolutional neural network (DCNN), a recurrent neural network (RNN), or the like. The degradation network may be a neural network in this application. Specifically, the neural network provided in the embodiments of this application may be a CNN, a deep convolutional neural network (DCNN), or the like.

[00115] As shown in FIG. 2, a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a convolutional neural network layer 230.

[00116] Convolutional Layer/Pooling Layer 220:

[00117] Convolutional Layer:

[00118] As shown in FIG. 2, the convolutional layer/pooling layer 220 may include layers 221 to 226. For example, in an implementation, a layer 221 is a convolutional layer, a layer 222 is a pooling layer, a layer 223 is a convolutional layer, a layer 224 is a pooling layer, a layer 225 is a convolutional layer, and a layer 226 is a pooling layer. In another implementation, layers 221 and 222 are convolutional layers, a layer 223 is a pooling layer, layers 224 and 225 are convolutional layers, and a layer 226 is a pooling layer. To be specific, an output of a convolutional layer may be used as an input of a subsequent pooling layer, or may be used as an input of another convolutional layer to continue to perform a convolution operation.

[00119] The following uses the convolutional layer 221 as an example to describe an internal working principle of one convolutional layer.

[00120] The convolutional layer 221 may include a plurality of convolution operators. The convolution operator is also referred to as a convolution kernel. In image processing, the convolution operator is equivalent to a filter that extracts specific information from an input image matrix. The convolution operator may essentially be a weight matrix. The weight matrix is usually predefined, and depends on a value of a stride in a process of performing a convolution operation on an image. The weight matrix usually processes pixels at a granularity level of one pixel or two pixels in a horizontal direction on the input image, to extract a specific feature from the image. A size of the weight matrix needs to be related to a size of the image. It should be noted that a depth dimension of the weight matrix is the same as that of the input image. In a convolution operation process, the weight matrix extends to an entire depth of the input image. The depth dimension is a channel dimension, and corresponds to a quantity of channels. Therefore, a convolutional output of a single depth dimension is generated through convolution with a single weight matrix. However, in most cases, a single weight matrix is not used, but a plurality of weight matrices with a same size (rows X columns), namely, a plurality of same-type matrices, are applied. Outputs of the weight matrices are stacked to form a depth dimension of a convolutional image. The dimension herein may be understood as being determined based on the foregoing “a plurality of’. Different weight matrices may be used to extract different features from an image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract a specific color of the image, and a further weight matrix is used to blur unneeded noise in the image. The plurality of weight matrices have the same size (rows X columns). Sizes of feature maps extracted by using the plurality of weight matrices with the same size are also the same, and then the plurality of extracted feature maps with the same size are combined to form an output of a convolution operation.

[00121] Weight values in these weight matrices need to be obtained through a lot of training in actual application. Each weight matrix formed by using the weight values obtained through training may be used to extract information from an input image, to enable the convolutional neural network 200 to perform correct prediction.

[00122] When the convolutional neural network 200 has a plurality of convolutional layers, a relatively large quantity of general features are usually extracted at an initial convolutional layer (for example, 221). The general feature may also be referred to as a low-level feature and corresponds to a high-resolution feature map. As a depth of the convolutional neural network 200 increases, a feature extracted at a subsequent convolutional layer (for example, 226) becomes more complex, for example, a high-level semantic feature, and corresponds to a low- resolution feature map. A feature with higher semantics is more applicable to a to-be-resolved problem.

[00123] Pooling Layer:

[00124] A quantity of training parameters often needs to be reduced. Therefore, a pooling layer often needs to be periodically introduced after a convolutional layer. For the layers 221 to 226 shown in 220 in FIG. 2, one convolutional layer may be followed by one pooling layer, or a plurality of convolutional layers may be followed by one or more pooling layers. During image processing, the pooling layer is only used to reduce a space size of an image. The pooling layer may include an average pooling operator and/or a maximum pooling operator, and may be configured to sample an input image to obtain a smaller image, and may be further configured to sample a feature map input by the convolutional layer to obtain a smaller feature map. The average pooling operator may be used to calculate pixel values in an image in a specific range, to generate an average value. The average value is used as an average pooling result. The maximum pooling operator may be used to select a pixel with a maximum value in a specific range as a maximum pooling result. In addition, similar to that a size of a weight matrix at the convolutional layer needs to be related to a size of an image, an operator at the pooling layer also needs to be related to a size of an image. A size of a processed image output from the pooling layer may be less than a size of an image input into the pooling layer. Each pixel in the image output from the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input into the pooling layer.

[00125] Convolutional Neural Network Layer 230:

[00126] After processing performed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not ready to output required output information. As described above, at the convolutional layer/pooling layer 220, only a feature is extracted, and parameters resulting from an input image are reduced. However, to generate final output information, the convolutional neural network 200 needs to generate processed image result by using the convolutional neural network layer 230. Therefore, the convolutional neural network layer 230 may include a plurality of hidden layers (231, 232, ... , and 23n shown in FIG. 2) and an output layer 240. Parameters included in the plurality of hidden layers may be obtained through pre-training based on related training data of a specific task type. For example, the task type may include image semantic segmentation, image classification, and super-resolution image reconstruction. The hidden layer may perform a series of processing on a feature map output from the convolutional layer/pooling layer 220 to obtain the image segmentation result. A process of obtaining the processed image result based on the feature map output from the convolutional layer/pooling layer 220 will be subsequently described in detail, and details are not described herein.

[00127] At the convolutional neural network layer 230, the plurality of hidden layers are followed by the output layer 240, namely, the last layer of the entire convolutional neural network 200. The output layer 240 has a loss function similar to classification cross entropy, and the loss function is specifically used to calculate a prediction error. Once forward propagation (propagation in a direction from 210 to 240, as shown in FIG. 2) of the entire convolutional neural network 200 is completed, reverse propagation (propagation in a direction from 240 to 210, as shown in FIG. 2) is started to update a weight value and a deviation of each layer mentioned above, to reduce a loss of the convolutional neural network 200 and an error between a result (namely, the foregoing image processing result) output by the convolutional neural network 200 by using the output layer and an ideal result.

[00128] It should be noted that the convolutional neural network 200 shown in FIG. 2 is merely an example convolutional neural network. In a specific application, the convolutional neural network may alternatively exist in a form of another network model.

[00129] FIG. 3 shows a hardware structure of a chip according to an embodiment of this application. The chip includes a neural network processing unit 30. The chip may be disposed in the execution device 110 shown in FIG. 1 , to complete calculation work of the calculation module 111. The chip may alternatively be disposed in the training device 120 shown in FIG. 1, to complete training work of the training device 120 and output the target model/rule 101. All algorithms at the layers in the convolutional neural network shown in FIG. 2 may be implemented on the chip shown in FIG. 3.

[00130] The neural network processing unit NPU 30 is mounted to a host CPU as a coprocessor, and the host CPU allocates a task to the NPU 30. A core part of the NPU is an operation circuit 303, and a controller 304 controls the operation circuit 303 to extract data in a memory (a weight memory or an input memory) and perform an operation.

[00131] In some implementations, the operation circuit 303 includes a plurality of processing engines (PEs). In some implementations, the operation circuit 303 is a two-dimensional systolic array. The operation circuit 303 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 303 is a general-purpose matrix processor.

[00132] For example, it is assumed that there are an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 303 fetches data corresponding to the matrix B from a weight memory 302, and buffers the data on each PE in the operation circuit 303. The operation circuit 303 fetches data of the matrix A from an input memory 301, to perform a matrix operation with the matrix B to obtain a partial result or a final result of a matrix, and stores the result in an accumulator 308.

[00133] A vector calculation unit 307 may perform further processing on the output of the operation circuit 303 such as vector multiplication, vector addition, exponential operation, logarithm operation, and size comparison. For example, the vector calculation unit 307 may be configured to perform network calculation such as pooling, batch normalization, or local response normalization, at a non-convolutional/non-FC layer in a neural network.

[00134] In some implementations, the vector calculation unit 307 can store a processed output vector in a unified buffer 306. For example, the vector calculation unit 307 may apply a nonlinear function to the output of the operation circuit 303, for example, to a vector of an accumulated value, so as to generate an activation value. In some implementations, the vector calculation unit 307 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as the activation input for the operation circuit 303, for example, the processed output vector is used in a subsequent layer in the neural network.

[00135] The unified memory 306 is configured to store input data and output data.

[00136] For weight data, a direct memory access controller (DMAC) 305 moves input data in an external memory to the input memory 301 and/or the unified memory 306, stores weight data in the external memory into the weight memory 302, and stores data in the unified memory 306 into the external memory.

[00137] A bus interface unit (BIU) 310 is configured to implement interaction between the host CPU, the DMAC, and an instruction fetch buffer 309 through a bus.

[00138] The instruction fetch buffer 309 connected to the controller 304 is configured to store instructions used by the controller 304.

[00139] The controller 304 is configured to invoke the instructions buffered in the instruction fetch buffer 309, to control a working process of the operation accelerator.

[00140] Generally, the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch buffer 309 are all on-chip memories. The external memory is a memory outside the NPU, and may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM), or another readable and writable memory. [00141] Operations of the layers in the convolutional neural network shown in FIG. 2 may be performed by the operation circuit 303 or the vector calculation unit 307.

[00142] The foregoing describes in detail basic content of a neural network and related apparatuses and models in the embodiments of this application with reference to FIG. 1 to FIG.

3. The following describes in detail the method in the embodiments of this application with reference to FIG. 4.

[00143] FIG. 4 is a schematic flowchart of a neural network training method 400 according to an embodiment of this application. The method may be performed by a device having a relatively strong operation capability such as a computer device, a server device or an operation device. The method 400 shown in FIG. 4 includes steps S410 to S440. The following separately describes the steps in detail.

[00144] S410, obtain multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image.

[00145] S420, input the low-quality image into an enhancement network to generate an enhanced image; and input the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, where the at least two intermediate tensors are determined based on the enhancement network.

[00146] S430, determine a first loss function based on the high-quality image and the enhanced image; and determine a second loss function based on the low-quality image and the degraded image.

[00147] S440, update the neural network based on the first loss function and the second loss function, where the neural network includes the enhancement network and the degradation network.

[00148] In this application, when the neural network is trained, not only loss for the enhancement process but also the loss for the degradation process is calculated. The information about the degradation process, extracted from the input low-quality image due to the image degradation process, can be used further in the enhancement network to improve the quality of the output enhanced image. Simultaneous learning of both image enhancement and image degradation is beneficial for better performing image enhancement. Our methods can be applied to fields including but not limited to SISR task or image deblurring task.

[00149] In step S410, the high-quality image and the low-quality image are images depicting the same scenery. In some implements, they are both real images taken by cameras. In other implements, they are artificial training datasets. In other words, the high-quality image can be a real image taken by cameras, yet the low-quality image is a degraded image from the high- quality image with other neural networks (e.g. by some kind of non-linear approach). The high- quality image and the low-quality image can have a same size. Optionally, they can have a different size. The low-quality image may be a low-resolution image for the SISR task or a blurred image for the deblurring task.

[00150] The low-quality image may be degraded by the application of non-uniform spatially variant degradation kernels of different origins. Sources of degradation include (but are not limited to) optical system’s PSFs, complex scene depth maps, camera motion, and scene object motion. Thus, using these training image pairs to train the neural network, the trained neural network will be applicable in real life images.

[00151] Also, the high-quality image and the low-quality image could be a tensor with a same size processed from the original high-quality image and the original low-quality image. For example, the original high-quality image and the original low-quality image is processed by several filters to generate several feature maps or image tensors.

[00152] We can use different camera parameters to acquire the high-quality and low-quality images. Some training image pairs may include moving objects, while some training image pairs may be acquired using a moving camera. Some training image pairs may be taken in dark lighting conditions. Some training image pairs may be landscape images including objects far away from the cameras. Some training image pairs may be taken by cameras with inferior optical systems, for example, by front cameras of smartphones.

[00153] Compared with images degraded by neural networks, using real low-quality images for training the neural network will be more applicable for real life images, which involves degradation parameters including but not limited to low resolution, and blur.

[00154] In S420, for the training stage, the neural network includes an enhancement network and a degradation network. The enhancement network receives as the input the low-quality image and the output of the enhancement network is the enhanced image. Also, the intermediate tensor determined by the enhancement network is used as the input of the degradation network. The degradation network receives as the input the high-quality image and the at least two intermediate tensors and the output of the degradation network is the degraded image.

[00155] In this application, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[00156] The enhancement network includes an encoder and a decoder. For example, the enhancement network may follow the UNet design. Also, it’s applicable that the enhancement network may have other structures, such as RRDB (residual in residual dense block) or RCAN (very deep residual channel attention network), as long as it has an encoder and a decoder.

[00157] The bottleneck tensor is the output of the encoder of the UNet-like network. This tensor has the smallest spatial size (height * width) of all the tensors involved in the UNet-like network.

[00158] Optionally, for the SISR task, the enhancement network can include additional image upscaling operations in the beginning or in the end part of the image enhancement network.

[00159] The first intermediate tensor can be part of the bottleneck tensor in the first encoder. The first intermediate tensor may have a height x width size the same as that of the bottleneck tensor yet include only part of the channels in the bottleneck tensor. Fraction of the bottleneck tensor extracted to be the first intermediate tensor may be 1/8. Also, other sizes may be used, for example, a half or quarter. In other words, a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[00160] The first intermediate tensor can be processed the result of the bottleneck tensor in the first encoder. For example, the bottleneck tensor is processed by several convolutional kernels, followed by a downscaling method to produce a global component tensor. Also, the global component tensor can be used as the input of the degradation network. In this implement, a height x width size of the global component tensor used as the first intermediate tensor is 1 x 1 or 2x2 or 3x3. In this case, the enhancement network includes a global component tensor estimation branch (to convert part of the bottleneck tensor into the global component tensor) [00161] In an example that 1/8 of the bottleneck tensor is extracted to produce the global component tensor, the global component tensor estimation branch is composed of a sequence of convolutions, each of which doubles the channel number of the tensor being processed while keeping its width and height the same. The number of such convolutions is chosen so that the output of the last convolution has the same channel number as the original bottleneck tensor (if 1/8 of the bottleneck tensor is extracted, the number is 3). Then the downscaling operation is applied to produce the global component tensor having a fixed spatial size of S g x S g , the number of S g can be 2, other numbers such as 2 or 3 can also be used. For the downscaling operation, any suitable downscaling method may be used, for example, bilinear, bicubic or area downscaling.

[00162] Then the global component tensor can be further processed and injected into the enhancement network’s first decoder. In other words, the method further includes: injecting a processed result of the first intermediate tensor into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1. Also, the first intermediate tensor can be injected back to the bottleneck tensor.

[00163] In the above implementation, the global component tensor will be processed and injected into at least one network layer in the first decoder. The injection method could be but is not limited to concatenation or affine injection. [00164] If the injection method is concatenation, the following process could be included: firstly, the global component tensor is compressed by the global average pooling operation to be the compressed global component tensor, which has a spatial size of 1 x 1 (the pooling operation doesn’t involve a convolution kernel); then for each network layer of the decoder, the compressed global component tensor is processed by at least two fully-connected layers to produce a tensor having the same spatial size 1 x 1 and the number of channels equal to that of the decoder’s tensor at this network layer; and finally this tensor is replicated along spatial dimensions to match the spatial size of the decoder’s tensor, so that the replicated tensor and the decoder’s tensor can finally be concatenated along the channel axis.

[00165] If the injection is affine injection, the global component tensor is also compressed to have a spatial size of 1 x1 by the pooling operation firstly; then the compressed global component tensor is processed by a number of fully-connected layers to produce the scaling tensor and the bias tensor, which are used to perform affine transform of the decoder’s tensor by multiplying it with the scaling tensor and summing with the bias tensor.

[00166] The injection of the global component tensor — global information of the decoupled degradation information, will improve the enhancement network performance during the inference stage, without requiring too much computation. Also, the per-pixel kernel map (discussed later) determined by the first intermediate tensor and the second intermediate tensor is not injected into the enhancement network to avoid the enormous computational penalty.

[00167] It is to be noted that for the case where part of the bottleneck tensor in the encoder is extracted to be used directly as the first intermediate tensor inputting into the degradation network, the global component tensor is also acquired in the degradation network, yet the global component tensor is not injected into the enhancement network. So, the global component tensor estimation branch is included in the degradation network as part of the pointwise convolution weights estimation subnetwork rather than included in the enhancement network. Pointwise convolution is a convolution with a kernel size of 1 x 1.

[00168] In this application, the degradation network includes a first subnetwork, the inputting the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image includes: inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; and inputting the high-quality image into the degradation network; convolving the high-quality image with per-pixel kernels to generate the degraded image, where the per-pixel kernels are determined from the per-pixel kernel map.

[00169] The first subnetwork is intended to generate the per-pixel kernel map, which is then processed to be per-pixel kernels to convolve with an input high-quality image to generate the degraded image. Also, in the first subnetwork, the first intermediate tensor determines the weights in the first subnetwork to convert the second intermediate tensor into the per-pixel kernel map. In other words, the first subnetwork takes as the input the second intermediate tensor (the second intermediate tensor is the local component tensor or the second intermediate tensor can be processed to get the local component tensor) and whose parameters (convolution weights) are estimated from the first intermediate tensor.

[00170] Instead of assuming a single degradation kernel per image, the present application estimates a per-pixel kernel map for the degradation process. Most practical image degradation scenarios such as optical system’s PSFs and motion blur imply non-uniform spatially variant degradation kernels. So, by estimating the per-pixel kernel map, the degradation network of the method better explores the underlying degradation process and thereby provides more informative clues for the enhancement network.

[00171] In this application, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the inputting the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map includes: inputting the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor; and inputting the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

[00172] The first intermediate tensor which is determined by the bottleneck tensor can indicate global information of the degradation information. And the second intermediate tensor determines the local component tensor which indicates local information about the degradation process. Instead of directly estimating the per-pixel kernel map, the present technical solution first decouples the degradation into a tensor about global information and local information. The tensor about global information is high-dimensional (for example, the first intermediate tensor that has a same dimension as the bottleneck tensor) but common for the entire image and the tensor about local information is low-dimensional (for example, the local component tensor with a dimension of 8, 4 ...) but specific for each pixel. The method then combines both global and local components for estimation of the per-pixel kernel map. Such decoupling dramatically decreases the internal dimensionality of the spatially variant degradation estimation task, thereby greatly mitigating its ill-posedness and making the per-pixel kernel map more consistent.

[00173] In other words, the first subnetwork includes two subnetworks: (1) a pointwise convolution weights estimation subnetwork which takes as the input the first intermediate tensor and the output is multiple dynamic pointwise convolution weights; (2) a kernel map estimation subnetwork which takes as the input a local component tensor and the output is per- pixel kernel map.

[00174] The local component tensor is a tensor with a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[00175] The local component tensor has a same size as the high-quality image. And after several pointwise convolution layers, it remains the same size, and can be divided into m (height x width) n-dimensional vectors, where each of the m vectors corresponds to each pixel in the high-quality image.

[00176] In some implementations, the local component tensor is the second intermediate tensor. The second intermediate tensor is determined based on at least one network layer in the first decoder. To be specific, the processed result of the at least one network layer in the first decoder is used as the second intermediate tensor to be input for the first subnetwork (kernel map estimation subnetwork).

[00177] In this case, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, and the method further includes: inputting at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; and concatenating the second intermediate tensor with the one of network layers in the first decoder.

[00178] The second intermediate tensor (local information about the degradation process or local component tensor) is injected into the enhancement network to improve the performance of it. As the local component tensor is low dimensional, the injection will not incur too much computational cost.

[00179] The kernel embedding map estimation subnetwork is structured similarly to the first decoder and includes a second decoder. Several network layer tensors in the first dncoder is concatenated with several network layer tensors of the kernel embedding map estimation subnetwork.

[00180] In other implementations, the tensor of the at least one network layer in the first decoder is directly used as the second intermediate tensor to be input for the first subnetwork.

[00181] In this case, the first subnetwork further includes a kernel embedding map estimation subnetwork, which means the first network altogether includes three subnetworks: a pointwise convolution weights estimation subnetwork, a kernel map estimation subnetwork, and a kernel embedding map estimation subnetwork. Then the inputting the second intermediate tensor into the first subnetwork and determining a local component tensor based on the second intermediate tensor includes: inputting the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

[00182] Also, the kernel embedding map estimation subnetwork includes a second decoder. Several network layer tensors in the first decoder are concatenated with several network layer tensors of the kernel embedding map estimation subnetwork. The output of the kernel embedding map estimation subnetwork is also the local component tensor, but it is directly used as the input of the kernel map estimation subnetwork and is not injected into the enhancement network again.

[00183] It is to be noted that the difference between the two cases (the kernel embedding map estimation subnetwork belongs to the enhancement network and the kernel embedding map estimation subnetwork belongs to the degradation network) lies in that if the output of the kernel embedding map estimation subnetwork (the local component tensor) is injected back to the enhancement network via concatenation. Also, in cases where the local component tensor is injected back to the enhancement network, the kernel embedding map estimation subnetwork can be simplified to reduce the computational cost.

[00184] From the above presentation, we explain the first intermediate tensor and the second intermediate tensor and the main structure of the enhancement and the degradation network.

[00185] In S430, for the enhancement network, the first loss function is determined and for the degradation network, the second loss function is determined.

[00186] During the training stage, the enhanced image and the high-quality image are used to compose the first loss function, which may be a combination of MAE (mean absolute error) loss or Li, structural similarity index measure (SSIM), mean squared error loss (MSE loss), perceptual, adversarial or any other losses used for SISR and image deblurring tasks. The first loss function quantifies the difference between the enhanced image and the high-quality image, and is minimized by the training procedure to make the image enhancement network output the enhanced image being close to the high-quality image as a reference.

[00187] During the training stage, the degraded image and the low-quality image are used to compose the degradation loss, denoted as the second loss function, which may be a combination of Li, SSIM, MSE, perceptual loss, adversarial loss, or any other losses used for SISR and image deblurring tasks. The experiments found that Li difference is sufficient for the degradation loss. The degradation loss quantifies the difference between the degraded image and the low-quality image, and is minimized by the training procedure to make the image degradation network output the degraded image close to the low-quality image and thereby indirectly promoting extraction of informative global component tensor and local component tensor containing important characteristics of the underlying degradation process specific to the input low-quality image.

[00188] The first loss function is taken as an example, for every training image pair, the MSE loss is the square of difference between the pixel value of the enhanced image and the pixel value of the high-quality image. For the multiple training image pairs, the MSE loss can be the average value of the MSE loss of the multiple training images. [00189] In S440, the neural network is updated using the first loss function and the second loss function.

[00190] Next, the structure of the enhancement network and the degradation network will be detailed for explaining the updating process of the S440.

[00191] As presented earlier, the main body of the enhancement network (the first encoder and the first decoder) and the kernel embedding map estimation subnetwork follow the same network design and both include an encoder and a decoder. Also, for the enhancement network, skip connections are provided between the first encoder and the first decoder, which will allow better training results and avoid problems of vanishing gradients or exploding gradients.

[00192] The first encoder, the first decoder, and the kernel embedding map estimation subnetwork both include multiple convolution layers. For each convolution layer, the convolution kernel needs to be trained. Also, weights involved in convolution layers of the global component tensor estimation branch and fully-connected layers of the global component tensor injection process need to be trained.

[00193] The pointwise convolution weights estimation subnetwork takes as the input the first intermediate tensor and the output is the multiple pointwise convolution weights. In the case where the first intermediate tensor is part of the bottleneck tensor in the first encoder, the first intermediate tensor is processed by several convolution layers and a downscaling layer to be the global component tensor firstly then the global component tensor is processed to be the multiple dynamic pointwise convolution weights. In the case where the first intermediate tensor is the global component tensor, the global component tensor is directly processed to be the multiple dynamic pointwise convolution weights. As details of the first intermediate tensor (part of bottleneck tensor in the first encoder) processed to be the global component tensor (global component tensor estimation branch) is described above (weights involved in the convolution layer needed to be trained), here we present how to transform the global component tensor into the multiple dynamic pointwise convolution weights.

[00194] The global component tensor is firstly flattened, e.g. reshaped the height * widthx channel from S g x S g x C g to 1 * 1 xC g S g 2 . The flattened global component tensor is subsequently used for estimation of the multiple dynamic pointwise convolution weights using at least two fully-connected layers. The multiple dynamic pointwise convolution weights are subsequently used in the kernel map estimation subnetwork in several network layers, so the number of the multiple dynamic pointwise convolution weights is determined by the number of the several layers using the dynamic pointwise convolution weights. The number may be 5 in our application. Also, other numbers such as 3, 4, 6, 7 8 can be used. In the training procedure of the network, weights of the at least two fully-connected layers need to be trained.

[00195] The kernel map estimation subnetwork takes as the input the local component tensor and the output is the per-pixel kernel map. The kernel map estimation subnetwork is structured as a multi-layer perceptron (MLP) taking as the input the C e -dimensional local component tensor (H x W, same as the height x width of the high-quality image) and generates the K 2 - dimensional vector. The process of converting the per-pixel kernel map into per-pixel kernels is as follows: the K 2 -dimensional vector from the per-pixel kernel map is reshaped to KxK degradation kernel impulse response and normalized to sum to unity to get H x W per-pixel kernels for the H x W pixels in the high-quality image, where K is the degradation kernel width/height and can be an odd value such as 25, though values such as 21, 23, 27, 29 can also be used. In practice, the MLP is implemented as a sequence of pointwise convolutions operating on tensors of the same spatial size as the high-quality image and having the channel number C e . The final normalization is done by dividing each KxK kernel by the sum of its elements.

[001 6] To make the conversion of the local component tensor to the per-pixel kernel map adaptive to the input image content, only a few first pointwise convolutions of the kernel map estimation subnetwork are static convolutions, while the rest are dynamic convolutions, whose weights are estimated by the pointwise convolution weights estimation subnetwork. Static convolution is a conventional convolution, whose weights are independent of the input data. Those weights are updated during the training stage according to the conventional training procedure, but are fixed at the inference stage. Dynamic convolution is a convolution, whose weights (dynamic pointwise convolution weights in this application) depend on the input data, typically by estimating them by a dedicated part of the network from the input data.

[00197] The parameter count of the pointwise convolution in the kernel map estimation subnetwork is determined by the input and output channel number of the network layers, e.g. for a convolution layer, the input channel number and the output channel number are both C e , the parameter count is C e x C e . The numbers for static and dynamic convolutions can be 2 and 5, respectively, though other values may also be used. For example, the number for static convolutions can be 1, 3, 4, 5, 6, or the like, and the number for dynamic convolutions can be 2, 3, 4, 6, 7, or the like. The static convolution weights are updated several times during the training process. Only weights of the static convolutions need to be trained.

[00198] Instead of restricting the per-pixel kernels (determined from the per-pixel kernel map) to belong to some simple low-parametric kernel family such as anisotropic Gaussian kernels, the present technical solution derives the degradation kernel for each pixel (per-pixel kernel) as the output of a very general non-linear transform operating on kernel latent space vectors of configurable dimension. The method uses MLP as a non-linear transform from the kernel latent space vector to kernel impulse response. In this manner, the method is able to estimate the per-pixel kernel map containing kernels from a rich kernel space provided by the MLP, which is well-known for its universal approximation capabilities.

[00199] Also, instead of employing a static MLP, whose weights are the same for all input images, the present technical solution uses MLP with dynamic layers, weights of which are estimated by another MLP from the global component tensor of the decoupled degradation. This approach allows the method to better adapt to each input image and thereby to better understand its underlying degradation.

[00200] Then the high-quality image is convolved with the per-pixel kernels determined from the per-pixel kernel map to generate the degraded image.

[00201] It is to be noted that convolutions and fully-connected layers are typically followed by non-linear activation, for example, a rectified linear unit (ReLU) or leaky ReLU. Exceptions include the last convolutions, which produce the output enhanced image, per-pixel kernel map, local component tensor as well as fully-connected layers which produce pointwise convolution weights.

[00202] To sum up, weights needed to be trained include: weights in the first encoder and the first decoder, weights of the fully-connected layers in the global component tensor estimation branch and the pointwise convolution weights estimation subnetwork, weights of the kernel embedding map estimation subnetwork, and weights of the static convolutions in the kernel map estimation subnetwork.

[00203] Before the training of the neural network, the above weights are initialized, then after the first training, the first loss function and the second loss function are calculated, and if the total loss function does not meet the standards, the weights is updated for the neural network, the second training starts, or the like. The training are finished until the total loss function meets the standards.

[00204] The total loss used by the training procedure for training the whole network is a sum of the enhancement and degradation losses. So, the training procedure simultaneously minimizes both losses. The standards can include the sum of the first loss function and the second loss function is smaller than a preset threshold.

[00205] Once the training is over, only the enhancement network is used at the inference stage. So, the properly trained image enhancement network should be integrated into the imaging device firmware, for example, mobile phone camera software. The image degradation network is only used during the training stage and is aimed at helping the encoder of the image enhancement network in learning efficient estimation of the informative global component tensor and local component tensor of the degradation process.

[00206] It is to be noted that the present technical solution can be applied to a wide variety of CNN-based image enhancement backbone networks to improve their quality. The only requirement for the applicability of the present application is that the backbone architecture should be dividable into the “encoder” and “decoder” parts. For example, the present application may be used to improve output image quality for many lightweight CNN-based image enhancement networks without significant deterioration of their computational performance, thereby providing high flexibility to target hardware and software requirements.

[00207] It is also to be noted that the process of the per-pixel kernels convolving with the high-quality image to generate the degraded image can be replaced with, for example, some kind of non-linear image filtering.

[00208] It is also to be noted that the kernel map estimation subnetwork using MLP could be replaced with some other transforms. Though the solution implemented in the application provides a good trade-off between richness of space of modeled degradation kernels and cumbersomeness of transform parameter estimation.

[00209] It is also to be noted that the pointwise convolution weights estimation subnetwork could be modified. [00210] FIG. 5 is a schematic flowchart of an image processing method 500 according to an embodiment of this application. The method may be performed by a device such as a hand-held device like a smartphone. The method 500 shown in FIG. 5 includes steps S510 to S520. The following separately describes the steps in detail.

[00211] S510, obtain a to-be-processed image;

[00212] S520, process the to-be-processed image with an enhancement network, where the enhancement network is obtained by training a neural network comprising the enhancement network and a degradation network based on multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image. The neural network is trained based on a first loss function and a second loss function, where the first loss function is determined based on the high-quality image and an enhanced image, the second loss function is determined based on the low-quality image and a degraded image. The enhanced image is generated by the enhancement network with an input of the low-quality image, the degraded image is generated by the degradation network with inputs of the high- quality image and at least two intermediate tensors, where the at least two intermediate tensors are determined based on the enhancement network.

[00213] In S510, the to-be-processed image is degraded by the application of non-uniform spatially variant degradation kernels of different origins. Sources of degradation include (but are not limited to) optical system’s PSFs, complex scene depth maps, camera motion, and scene object motion.

[00214] For example, the to-be-processed image is a selfie photo taken by the smartphone’s front camera. Typically, the front camera has an inferior optical system, which results in degraded image details on photos taken by this camera.

[00215] For another example, the to-be-processed image is a landscape photo taken by the smartphone’s main camera. Landscape photos often include objects located far from the camera. Such objects lack fine-grained details due to blurring by the main camera’s PSF.

[00216] For another example, the to-be-processed image is a photo taken in low light conditions. Typically, for such conditions photos are taken using long exposure, which results in image blurring by motion blur due to camera motion or moving scene objects.

[00217] In S520, the to-be-processed image is processed with the enhancement network. The enhancement network used in method 500 is extracted from the neural network trained by methods provided in method 400.

[00218] Details of method 500 can be referred to descriptions in method 400, which will not be described here to avoid repetition.

[00219] FIG. 6 is an example neural network, which may be the neural network used for training in method 400. Next, we will use the neural network provided in FIG. 6 to further present the process of method 400.

[00220] As shown in FIG. 6, the neural network includes an enhancement network and a degradation network. The enhancement network receives as the input the low-quality image 10 to generate the enhanced image 19. The degradation network receives as the input the high- quality image 88 and the at least two intermediate tensors are determined by the enhancement network: tensor 22, tensor 14, tensor 16, and tensor 18 to generate the degraded image 87, where the tensor 22 is the first intermediate tensor, and the second intermediate tensor includes tensor 14, tensor 16, and tensor 18.

[00221] In the enhancement network, the first encoder includes tensor 10, tensor 11, tensor 12, and tensor 13, where tensor 131 is the bottleneck tensor in the first encoder. The first decoder includes tensor 14, tensor 15, tensor 16, tensor 17, tensor 18, and tensor 19. The first decoder combines the extracted features using skip connections from corresponding layers in the first encoder. Skip connections (S29, S30) from tensor 11 and tensor 12 propagate low level image details to higher levels (tensor 15 and tensor 17) for improved local component tensor estimation during training.

[00222] In the first encoder, the input low-quality image 10 is processed by several convolution operations to be the bottleneck tensor 131. FIG. 6 shows three convolution operations: S10, Si l, and SI 2.

[00223] The enhancement network also includes a global component tensor estimation branch, which takes as input part of the bottleneck tensor 131 (for example, 1/8 or 1/4 or 1/2 of the bottleneck tensor 131) and outputs the global component tensor 22, which is used as the first intermediate tensor input to the degradation network.

[00224] The global component tensor estimation branch includes a sequence of convolution operations (SI 9 and S20), each of which doubles the channel number of the tensor being processed while keeping its height and width size the same, so that the output of the last convolution operation (tensor 21) has the same channel number as the bottleneck tensor 131, which means the convolution operations are 3, 2, and 1 respectively for taking 1/8, 1/4, and 1/2 part of the bottleneck tensor. FIG. 6 only depicts two convolution operations for simplicity. Then the downscaling operation (S21) is applied to tensor 21 to produce the global component tensor 22 with a fixed spatial size of S g * S g , where S g is the height/width of the global component tensor. The number of can be 2, 1 or 3.

[00225] The global component tensor 22 is further injected into the first decoder via concatenation. In FIG. 6, the global component is concatenated with tensor 151 and tensor 152 to form tensor 15, with tensor 171 and tensor 172 to form a tensor 17. Also, the global component tensor is concatenated with bottleneck tensor 131 to form tensor 13. Here, the global component tensor needs to be processed to concatenate with the corresponding tensor.

[00226] In an example that the global component tensor is processed to concatenate with the bottleneck tensor 131 , the global component tensor 22 is first compressed by the global average pooling operation S22 to produce the compressed global component tensor 23, which is then followed by at least two fully connected operations S23 and S24 to produce a tensor having the spatial size 1 x 1 and the same channel number as bottleneck tensor 131. The tensor is replicated along spatial dimensions to match the size of the bottleneck tensor 131. The bottleneck tensor 131 and the injected tensor 132 are concatenated to form a tensor 13, serving as the input of the convolution operation S13.

[00227] In the first decoder, the tensor 13 is also processed by several convolution operations (S13, S14, S15, S16, S17, and S18) to produce the enhanced image 19. Note that for the operation SI 5, the input tensor includes three parts: the output result of the S14 (tensor 151), the feature from tensor 12 via skip connection (tensor 152), and injected global component tensor 153. For the convolution operation S17, input tensor is similar to S15, which is not described here again to avoid repetition.

[00228] Note that convolution operations and fully-connected operations are typically followed by non-linear activation (for example, ReLU or leaky ReLU), which is not depicted in FIG. 6 for simplicity.

[00229] The degradation network includes a first subnetwork which takes as the input the second intermediate tensor (tensor 14, tensor 16 and tensor 18) and the first intermediate tensor 22, then outputs the per-pixel kernel map 85, where the first intermediate tensor 22 is used to determine weights of some convolution operations (S82 and S83) in the first subnetwork. The first subnetwork includes layer tensors: tensor 41 to tensor 45, tensor 80 to tensor 88 as shown in FIG. 6. The first subnetwork further includes three subnetworks: the pointwise convolution weights estimation subnetwork, the kernel map estimation subnetwork, and the kernel embedding map estimation subnetwork.

[00230] The kernel embedding map estimation subnetwork takes as the input the second intermediate tensor and outputs the local component tensor 80 via several convolution operations including S41 to S45. The kernel embedding map estimation subnetwork follows a similar structure as the enhancement network. For convolution operation S42, the input includes the output result of convolution operation S41 (tensor 421) and a copy of the second intermediate tensor 16 (tensor 422). For convolution operation S44, the situation is similar, which is not detailed here for simplicity.

[00231] The kernel map estimation subnetwork takes as the input the local component tensor 80 and outputs the per-pixel kernel map 85. The local component tensor 80 is firstly processed by several static pointwise convolution operations (S80 and S81) with a kernel size 1 x 1, followed by several dynamic pointwise convolution operations (S82 and S83), whose weights are determined by the pointwise convolution weights estimation subnetwork (discussed later). Tensor 80 to tensor 84 can have a same height x width size as the high-quality image 88, and the channel numbers of the tensor 80 to tensor 84 are the same, for example, the channel number is 8 (other channel numbers such as 4, 16, 24, 32... may also be used). The output of the convolution operation S83 is tensor 85, with the same height x width size and a channel number of K 2 , where K can be odd numbers such as 25, 21, 23, 19, 17, or the like. The tensor 85 is then converted into the per-pixel kernels (tensor 86) by being reshaped to several KxK per-pixel kernels (the number of the per-pixel kernels is height x width, corresponding to pixel numbers of the high-quality image 88) and normalized to sum to unity, producing the per-pixel kernels 86.

[00232] In the kernel map estimation subnetwork, the number of static pointwise convolution operations such as S80 and S81 can be 2 (other numbers such as 1, 3, 4, 5, or the like can also be used); and the number of dynamic pointwise convolution operations such as S82 and S83 can be 5 (other numbers such as 3, 4, 5, 6, 7, 8, or the like can also be used).

[00233] The weights of the dynamic pointwise convolution operations are estimated by the pointwise convolution weights estimation subnetwork, which takes as the input the global component tensor 22 and outputs the dynamic pointwise convolution weights 61 and 62. The global component tensor 22 is first reshaped from S g x S g x C g to 1 * 1 x C g S g 2 to form the tensor 60. Then for every dynamic pointwise convolution weight, tensor 60 is processed by at least 2 fully-connected operations to form the corresponding dynamic pointwise convolution weights. For operation S82, C e x C e -sized dynamic pointwise convolution weights are needed. And for operation S83, K 2 x Ce-sized dynamic pointwise convolution weights are needed, where C e is the channel number of the tensor 82 and 84.

[00234] Then the per-pixel kernels 86 are convolved with the high-quality image 88 to produce the degraded image 87. After producing the enhanced image 19 and the degraded image 87, the first loss function which describes the difference between the enhanced image 19 and the high-quality image 88 is calculated, and the second loss function which describes the difference between the degraded image 87 and the low-quality image 10 is calculated. The total loss function which is the sum of the first loss function and the second loss function is minimized by updating the weights involved in the neural network.

[00235] Once the total loss function is over, the enhancement network can be extracted to implement the image enhancement operation. FIG. 7 is the network of the inference process, which may be used to implement the image processing method 500. And the enhancement network provided in FIG. 7 corresponds with the neural network of the training process in FIG. 6.

[00236] FIG. 8 is another example of a neural network, which may be the neural network used for training in method 400.

[00237] As shown in FIG. 8, the neural network depicted in FIG. 8 differs from that in FIG. 6 in the absence of the global component tensor injection in the image enhancement network. Accordingly, the global component estimation branch is excluded from the image enhancement network and moved to the image degradation network. In all other aspects, it is the same as that depicted in FIG. 6. [00238] FIG. 9 shows an image enhancement network which is used at the inference stage corresponding to FIG. 8. Since it lacks the global component tensor estimation and injection, it’s identical to the conventional backbone network widely used for SISR or image deblurring tasks. However, due to the training stage including the image degradation network, the image enhancement network is trained to produce enhanced images of better quality as compared to the conventional methods which train this network without the help of the image degradation network.

[00239] FIG. 10 is another example of a neural network, which may be the neural network used for training in method 400.

[00240] As shown in FIG. 10, the neural network depicted in FIG. 10 differs from that in FIG. 6, and is a greatly simplified kernel embedding map estimation subnetwork, which takes as the input only the last tensor from the decoder of the image enhancement network and processes it using two convolutions to produce the kernel embedding map.

[00241] Another difference is that the resulting kernel embedding map is injected back into the decoder of the image enhancement network via concatenation.

[00242] FIG. 11 shows an image enhancement network that is used at the inference stage corresponding to FIG. 10. In this network, both the global and the local components of the degradation information are injected into the decoder.

[00243] It is to be noted that the neural network training method provided in this application can be used in other aspects, such as image dehazing, image contrast/ brightness/ color enhancement, image colorization.

[00244] For the image dehazing application, the image degradation network should be replaced by some kind of image hazing network, which produces a plausible hazy image from a clean one. This requires inventing an image hazing network which takes as the input the local component data and whose parameters (e.g. convolution weights) are estimated from the global component data. And the training image pair should be replaced by hazed image and dehazing image pairs. Also, the global component tensor is determined by the encoder of the enhancement network and the local component tensor is determined by the decoder of the enhancement network.

[00245] The enhancement network depicted in FIG. 11 provides better visual quality due to the injection of both local and global components. The enhancement network of FIG. 9 incurs less computation penalty as there is no global component tensor and local component tensor injected into the first decoder of the enhancement network. The enhancement network of FIG. 7 incurs less computation penalty than the enhancement network in FIG. 11 and better visual quality than the enhancement network in FIG. 9 as the injection of global component tensor into the first decoder of the enhancement network.

[00246] FIG. 12 is a schematic block diagram of a neural network training apparatus 1200 according to an embodiment of this application. As shown in FIG. 12, the neural network training apparatus 1200 includes: an obtaining module 1210, configured to obtain multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image, where the high-quality image and the low-quality image are images with different qualities depicting the same scenery; input the low-quality image into an enhancement network to generate an enhanced image; a training module 1220, configured to: input the high-quality image and at least two intermediate tensors into a degradation network to generate a degraded image, where the at least two intermediate tensors are determined based on the enhancement network; determine a first loss function based on the high-quality image and the enhanced image; determine a second loss function based on the low-quality image and the degraded image; update the neural network based on the first loss function and the second loss function, where the neural network includes the enhancement network and the degradation network.

[00247] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[00248] In one optional implementation, the degradation network includes a first subnetwork, the training module 1220 is specifically configured to: input the first intermediate tensor and the second intermediate tensor into the first subnetwork to generate a per-pixel kernel map, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor; input the high-quality image into the degradation network; convolve the high-quality image with per-pixel kernels to generate the degraded image, where the per-pixel kernels are determined from the per-pixel kernel map.

[00249] In one optional implementation, a height x width size of the first intermediate tensor is 1 x 1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[00250] In one optional implementation, the training module 1220 is further configured to inject a processed result of the first intermediate tensor into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1. [00251] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the training module 1220 is specifically configured to: input the first intermediate tensor into the pointwise convolution weights estimation subnetwork to generate multiple dynamic pointwise convolution weights; input the second intermediate tensor into the first subnetwork and determine a local component tensor based on the second intermediate tensor; input the local component tensor and the multiple dynamic pointwise convolution weights into the kernel map estimation subnetwork to generate the per-pixel kernel map, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights.

[00252] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, the training module 1220 is specifically configured to input the second intermediate tensor into the kernel embedding map estimation subnetwork to generate the local component tensor.

[00253] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the training module 1220 is further configured to: input at least one of network layer tensors in the first decoder into the kernel embedding map estimation subnetwork to generate the second intermediate tensor; concatenate the second intermediate tensor with one of network layer tensors in the first decoder.

[00254] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[00255] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[00256] In one optional implementation, the enhancement network follows a U-net design.

[00257] FIG. 13 is a schematic block diagram of an image processing apparatus 1300 according to an embodiment of this application. As shown in FIG. 13, the image processing apparatus 1300 includes: an obtaining module 1310, configured to obtain a to-be-processed image; a processing module 1320, configured to process the to-be-processed image with an enhancement network, where the enhancement network is obtained by training a neural network comprising the enhancement network and a degradation network based on multiple training image pairs, where each pair of the multiple training image pairs includes a high-quality image and a low-quality image. The neural network is trained based on a first loss function and a second loss function, where the first loss function is determined based on the high-quality image and an enhanced image, the second loss function is determined based on the low-quality image and a degraded image. The enhanced image is generated by the enhancement network with an input of the low-quality image, the degraded image is generated by the degradation network with inputs of the high-quality image and at least two intermediate tensors, where the at least two intermediate tensors are determined based on the enhancement network.

[00258] In one optional implementation, the enhancement network includes a first encoder and a first decoder; the at least two intermediate tensors include a first intermediate tensor and a second intermediate tensor, where the first intermediate tensor is determined based on a bottleneck tensor in the first encoder and the second intermediate tensor is determined based on at least one of network layers in the first decoder.

[00259] In one optional implementation, the degradation network includes a first subnetwork, the degraded image is obtained by convolving the high-quality image with per-pixel kernels, the per-pixel kernels are determined from a per-pixel kernel map, which is generated by the first subnetwork with inputs of the first intermediate tensor and the second intermediate tensor, where convolution weights of multiple network layers of the first subnetwork are determined based on the first intermediate tensor. [00260] In one optional implementation, a height x width size of the first intermediate tensor is 1 *1 or 2x2 or 3x3; or a height x width size of the first intermediate tensor is the same as that of the bottleneck tensor, while a channel number of the first intermediate tensor is 1/8 or 1/4 or 1/2 of a channel number of the bottleneck tensor.

[00261] In one optional implementation, a processed result of the first intermediate tensor is further injected into at least one network layer in the first decoder, where the processed result of the first intermediate tensor has a spatial size of 1 x 1.

[00262] In one optional implementation, the first subnetwork includes a pointwise convolution weights estimation subnetwork and a kernel map estimation subnetwork, the per- pixel kernel map is generated by the kernel map estimation subnetwork with inputs of a local component tensor and multiple dynamic pointwise convolution weights, where convolution weights of multiple network layers of the kernel map estimation subnetwork are the multiple dynamic pointwise convolution weights; the multiple dynamic pointwise convolution weights are generated by the pointwise convolution weights estimation subnetwork with an input of the first intermediate tensor, the local component tensor is determined based on the second intermediate tensor.

[00263] In one optional implementation, the first subnetwork further includes a kernel embedding map estimation subnetwork, the local component tensor is determined by the kernel embedding map estimation subnetwork with an input of the second intermediate tensor.

[00264] In one optional implementation, the enhancement network further includes a kernel embedding map estimation subnetwork, the local component tensor is the second intermediate tensor, the second intermediate tensor is generated by the kernel embedding map estimation subnetwork with an input of at least one of network layer tensors in the first decoder; the second intermediate tensor is concatenated with one of network layer tensors in the first decoder.

[00265] In one optional implementation, a height x width size of the local component tensor is the same as that of the high-quality image and a channel number of the local component tensor is 4 or 8 or 16 or 24 or 32.

[00266] In one optional implementation, skip connections are provided between the first encoder and the first decoder.

[00267] In one optional implementation, the enhancement network follows a U-net design. [00268] FI G. 14 is a schematic diagram of a hardware structure of a neural network training apparatus according to an embodiment of this application. A neural network training apparatus 1400 shown in FIG. 14 includes a memory 1410, a processor 1420, a communications interface 1430, and a bus 1440. The memory 1410, the processor 1420, and the communications interface 1430 implement communication connection to each other through the bus 1440.

[00269] The memory 1410 may store a program. When the program stored in the memory 1410 is executed by the processor 1420, the processor 1420 is configured to perform steps of the neural network training method 400 in the embodiments of this application.

[00270] The processor 1420 may use a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is configured to execute a related program, to perform the neural network training method 400 in the embodiments of this application.

[00271] The processor 1420 may alternatively be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the neural network training method in the embodiments of this application may be accomplished by using an integrated logic circuit of hardware in the processor 1420 or instructions in a form of software.

[00272] It should be understood that the neural network training apparatus 1400 shown in FIG. 14 is used to train a neural network, and part of the neural network obtained through training (the enhancement network) may be used to perform the image processing method 500 in the embodiments of this application. Specifically, the neural network in the method 400 shown in FIG. 4 can be obtained by training the neural network by using the apparatus 1400.

[00273] Specifically, the apparatus shown in FIG. 14 may obtain training image pairs and a to-be-trained neural network from the outside through the communications interface 1430, and then the processor trains the to-be-trained neural network based on the training image pairs.

[00274] FIG. 15 is a schematic diagram of a hardware structure of an image processing apparatus 1500 according to an embodiment of this application. An image processing apparatus 1500 shown in FIG. 15 includes a memory 1510, a processor 1520, a communications interface 1530, and a bus 1540. The memory 1510, the processor 1520, and the communications interface 1530 implement communication connection to each other through the bus 1540.

[00275] The memory 1510 may store a program. When the program stored in the memory 1510 is executed by the processor 1520, the processor 1520 is configured to perform steps of the image processing method 500 in the embodiments of this application.

[00276] The processor 1520 may use a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is configured to execute a related program, to perform the image processing method 500 in the embodiments of this application.

[00277] The processor 1520 may alternatively be an integrated circuit chip and has a signal processing capability. In an implementation process, steps of the image processing method 500 in the embodiments of this application may be accomplished by using an integrated logic circuit of hardware in the processor 1520 or instructions in a form of software.

[00278] Specifically, the apparatus shown in FIG. 15 may obtain to-be-processed image from the outside through the communications interface 1530, and then the processor performs the image processing method based on the to-be-processed image to generate processed image.

[00279] It should be noted that although the apparatus 1400 and 1500 show only the memory, the processor, and the communications interface, in a specific implementation process, a person skilled in the art should understand that the apparatuses 1400 and 1500 may further include another component necessary for normal operation. In addition, based on a specific requirement, a. person skilled in the art should understand that the apparatuses 1400 and 1500 may further include hardware components for implementing other additional functions. In addition, a person skilled in the art should understand that the apparatuses 1400 and 1500 may include only components required for implementing the embodiments of this application, and there is need to include all components shown in FIG. 14 and FIG. 15.

[00280] A person of ordinary skill in the art may be aware that, with reference to the units and algorithm steps described in the examples of the embodiments disclosed in this specification, the embodiments of this application may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.

[00281] It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference is made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

[00282] In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, division into units is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms. [00283] The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. A part or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.

[00284] In addition, functional units in the embodiments of this application may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

[00285] When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of this application essentially, or the part contributing to the prior art, or some of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or a part of the steps of the methods described in the embodiments of this application. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disc.

[00286] The foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this application shall fall within the protection scope of this application.