Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COMPUTER-IMPLEMENTED METHOD FOR SEMANTIC SEGMENTATION OF AN IMAGE
Document Type and Number:
WIPO Patent Application WO/2024/046851
Kind Code:
A1
Abstract:
A computer-implemented method for semantic segmentation of an image (10), the image (10) including a plurality of pixels (12), a position (14) of each pixel (12) being represented by two-dimensional pixel coordinates (16), the method comprising: a) Feeding image data (18) of the plurality of pixels (12) to an initial convolutional neural network (26) in order to create feature map data (28) of the plurality of pixels (12); b) Generating a hypernetwork (36) having at least one hyperlayer (44), the at least one hyperlayer (44) including a neural network (46), the generating including: c) Feeding the feature map data (28) of a selected pixel (12) as a parameter (40) and a set of periodic basis functions (50) as an input to the neural network (46), the set of periodic basis functions (50) representing the pixel coordinates (16) of the selected pixel (12); and d) Generating a segmentation mask (56) based on an output (42) of the neural network (46).

Inventors:
HOY MICHAEL COLIN (SG)
Application Number:
PCT/EP2023/073195
Publication Date:
March 07, 2024
Filing Date:
August 24, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CONTINENTAL AUTOMOTIVE TECH GMBH (DE)
International Classes:
G06V10/26; G06V10/82
Other References:
NIRKIN YUVAL ET AL: "HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 4060 - 4069, XP034007782, DOI: 10.1109/CVPR46437.2021.00405
ANONYMOUS: "1D, 2D, and 3D Sinusoidal Postional Encoding (Pytorch and Tensorflow)", 25 July 2022 (2022-07-25), pages 1 - 3, XP093102221, Retrieved from the Internet [retrieved on 20231116]
TAN MINGXING ET AL: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks", 11 September 2020 (2020-09-11), pages 1 - 11, XP055873656, Retrieved from the Internet [retrieved on 20211216]
LIU, ZICHENJUN HAO LIEWXIANGYU CHENJIASHI FENG: "DANCE: A Deep Attentive Contour Model for Efficient Instance Segmentation", PROCEEDINGS OF THE IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, 2021, pages 345 - 354, XP033926415, DOI: 10.1109/WACV48630.2021.00039
LIANG, JUSTINNAMDAR HOMAYOUNFARWEI-CHIU MAYUWEN XIONGRUI HURAQUEL URTASUN: "Polytransform: Deep polygon transformer for instance segmentation", IN PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020, pages 9131 - 9140
PENG, SIDAWEN JIANGHUAIJIN PIXIULI LIHUJUN BAOXIAOWEI ZHOU: "Deep snake for real-time instance segmentation", IN PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020, pages 8533 - 8542
YUAN, YUHUIJINGYI XIEXILIN CHENJINGDONG WANG: "European Conference on Computer Vision", 2020, SPRINGER, article "Segfix: Model-agnostic boundary refinement for segmentation", pages: 489 - 506
HOMAYOUNFAR, NAMDARYUWEN XIONGJUSTIN LIANGWEI-CHIU MARAQUEL URTASUN: "In European Conference on Computer Vision", 2020, SPRINGER, article "Levelset r-cnn: A deep variational method for instance segmentation", pages: 555 - 571
TAKIKAWA, TOWAKIDAVID ACUNAVARUN JAMPANISANJA FIDLER: "Gated-scnn: Gated shape cnns for semantic segmentation", PROCEEDINGS OF THE IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2019, pages 5229 - 5238
YOO, SEUNGWOOHEE SEOK LEEHEESOO MYEONGSUNGRACK YUNHYOUNGWOO PARKJANGHOON CHODUCK HOON KIM: "End-to-end lane marker detection via row-wise classification", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, 2020, pages 1006 - 1007
VAN GANSBEKE, WOUTERBERT DE BRABANDEREDAVY NEVENMARC PROESMANSLUC VAN GOOL: "End-to-end lane detection through differentiable least-squares fitting", IN PROCEEDINGS OF THE IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, 2019, pages 0 - 0
LIU, RUIJINZEJIAN YUANTIE LIUZHILIANG XIONG: "End-to-end lane shape prediction with transformers", PROCEEDINGS OF THE IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, 2021, pages 3694 - 3702
WANG, ZEWEIQIANG REQIANG QIU, LANENET: REAL-TIME LANE DETECTION NETWORKS FOR AUTONOMOUS DRIVING, 2018
CHEN, ZHIQINHAO ZHANG: "Learning implicit fields for generative shape modeling", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 5939 - 5948
MICHALKIEWICZ, MATEUSZJHONY K. PONTESDOMINIC JACKMAHSA BAKTASHMOTLAGHANDERS ERIKSSON: "Implicit surface representations as layers in neural networks", PROCEEDINGS OF THE IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2019, pages 4743 - 4752
BENBARKA, NURITIMON HBFERANDREAS ZELL, SEEING IMPLICIT NEURAL REPRESENTATIONS AS FOURIER SERIE, 2021
SITZMANN, VINCENTJULIEN MARTELALEXANDER BERGMANDAVID LINDELLGORDON WETZSTEIN: "Implicit neural representations with periodic activation functions", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 33, 2020
JIANG, CHIYUAVNEESH SUDAMEESH MAKADIAJINGWEI HUANGMATTHIAS NIEΒNERTHOMAS FUNKHOUSER: "Local implicit grid representations for 3d scenes", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pages 6001
IBING, MORITZISAAK LIMLEIF KOBBELT: "3D Shape Generation with Grid-based Implicit Functions", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2021, pages 13559 - 13568
DEVRIES, TERRANCEMIGUEL ANGEL BAUTISTANITISH SRIVASTAVAGRAHAM W. TAYLORJOSHUA M. SUSSKIND, UNCONSTRAINED SCENE GENERATION WITH LOCALLY CONDITIONED RADIANCE FIELDS, 2021
COSTAIN, THEO W.VICTOR ADRIAN PRISACARIU, TOWARDS GENERALISING NEURAL IMPLICIT REPRESENTATIONS, 2021
KOHLI, AMIT PAL SINGHVINCENT SITZMANNGORDON WETZSTEIN: "2020 International Conference on 3D Vision (3DV", 2020, IEEE, article "Semantic implicit neural scene representations with semi-supervised training", pages: 423 - 433
LIN, TSUNG-YIPRIYA GOYALROSS GIRSHICKKAIMING HEPIOTR DOLLAR: "Focal loss for dense object detection", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2017, pages 2980 - 2988
Attorney, Agent or Firm:
CONTINENTAL CORPORATION (DE)
Download PDF:
Claims:
CLAIMS

1 . A computer-implemented method for semantic segmentation of an image (10), the image (10) including a plurality of pixels (12), a position (14) of each pixel (12) being represented by two-dimensional pixel coordinates (16), the method comprising: a) Feeding image data (18) of the plurality of pixels (12) to an initial convolutional neural network (26) in order to create feature map data (28) of the plurality of pixels (12); b) Generating a hypernetwork (36) having at least one hyperlayer (44), the at least one hyperlayer (44) including a neural network (46), the generating including: c) Feeding the feature map data (28) of a selected pixel (12) as a parameter (40) and a set of periodic basis functions (50) as an input to the neural network (46), the set of periodic basis functions (50) representing the pixel coordinates (16) of the selected pixel (12); and d) Generating a segmentation mask (56) based on an output (42) of the neural network (46).

2. The method according to claim 1 , ch a racte ri zed i n th at step c) comprises: c1 ) Feeding the feature map data (28) and the set of periodic basis functions (50) to the neural network (46) for each of the plurality of pixels (12) separately.

3. The method according to claim 1 or 2, ch a racte ri zed i n th at the set of periodic basis functions (50) include sine and cosine functions, and preferably a base harmonic of the set of periodic basis functions (50) matches a resolution of the feature map data (28).

4. The method according to any of the preceding claims, ch a racte rized i n th at step c) comprises: c2) Feeding the output (42) of the neural network (46) to a convolutional neural network (52), the at least one hyperlayer (44) including the convolutional neural network (52), the convolutional neural network (52) preferably being a spatial convolutional neural network. 5. The method according to any of the preceding claims, ch a racte rized i n th at step b) comprises: b1 ) Generating the hypernetwork (36) with a plurality of connected hyperlayers (44), each hyperlayer (44) including a neural network (46), the output (42) of the neural network (46) preferably being fed to a convolutional neural network (52).

6. The method according to any of the preceding claims, ch a racte rized i n th at the initial convolutional neural network (26) includes a residual neural network layer (34) and/or the initial convolutional neural network (26) includes a two-dimensional convolutional neural network layer (32).

7. Method of any of the preceding claims, ch a racte rized i n t h at the neural network (46) of the at least one hyperlayer (44) is a multilayer perceptron (48).

8. The method according to any of the preceding claims, the method further comprising: e) Before generating the hypernetwork (36) or feeding the feature map data (28) to the neural network (46), performing an up-sampling of the feature map data (28) in order to generate up-sampled feature map data (30), the up-sampling using at least one kernel (60), preferably the at least one kernel (60) is periodic.

9. The method according to any of the preceding claims, the method further comprising: f) Training the initial convolutional network (26) and the hypernetwork (36) simultaneously.

10. The method according to claim 8 and 9, ch a racte rized i n t h at step f) comprises: f1 ) During training, adding a random phase offset (62) to the at least one kernel (60).

11. The method according to claim 8, ch a racte ri zed i n th at step e) comprises: e1 ) Using separate kernels (60a, 60b) for each of the dimensions; and e2) Up-sampling the feature map data (28) with a sum of the separate kernels (60a, 60b).

12. The method according to any of the preceding claims, the method further comprising one or both of the following: g) Pooling the feature map data (28) and performing step c) with the pooled feature map data (68); and h) Generating a plurality of hypernetworks (34, 64, 70, 76) with one or more times pooled feature map data (68, 74) and hierarchically structuring the generated hypernetworks (36, 64, 70, 76).

13. A data processing device (80) comprising means for carrying out the method of any of the preceding claims.

14. A computer program (82) comprising instructions which, when the program (82) is executed by a computer, cause the computer to carry out the method of any of the claims 1 to 12.

15. A computer-readable data carrier (84) having stored thereon the computer program (82) of claim 14.

Description:
DESCRIPTION

Computer-implemented method for semantic segmentation of an image

TECHNICAL FIELD

The invention relates to a computer-implemented method for semantic segmentation of an image. The invention further relates to a data processing device, a computer program and a computer-readable data carrier.

BACKGROUND

Semantic segmentation is a widely used method to assign labels to every pixel in an input image. Semantic segmentation networks generally comprise an encoder and a decoder. A common method to implement a semantic segmentation decoder is a neural network structure known as “convolutional neural network” (CNN). For CNNs, it may be hard to learn “thin” or fine structures, such as poles, kerbs, etc. Furthermore, it may be hard for CNNs to learn spatial smoothness relationships across medium-large distance scales. This may be especially pronounced when semantic segmentation is applied to applications such as birdseye view (BEV) map prediction, where boundaries are generally smooth, but the geometric structure of different parts of the map must be represented accurately and consistently in the output.

One approach is to use iterative boundary refinement with CNN architectures so that errors/inconsistencies can be gradually corrected. This approach may either use key points or local deformations of the pixels and may process only one class at a time. Improved output layers may represent geometry explicitly. Still, such approaches are very specific to certain applications (e.g., road marker segmentation) and may not be generalized to a broader range of applications.

The present invention relates to the application of conditional implicit shape models (conditional ISM) to the segmentation problem. In conditional implicit shape models, a single neural network can be trained to generate an output conditioned on some input data. This means that the ISM is not trained separately for every data example. A synonym for conditional implicit shape models is “online implicit shape models”.

One way of implementing this type of ISM is to use input augmentation. In input augmentation, the conditioning parameter are appended to the positional encoding on the input of the ISM. Another way of implementing conditional ISMs is based on hypernetworks. In hypernetworks, the parameters of the “rendering network” are learned. This allows the ISM to be smaller.

Iterative refinement methods are subject-matter of the following documents:

[1] Liu, Ziehen, Jun Hao Liew, Xiangyu Chen, and Jiashi Feng. "DANCE: A Deep Attentive Contour Model for Efficient Instance Segmentation." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 345- 354. 2021.

[2] Liang, Justin, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, and Raquel Urtasun. "Polytransform: Deep polygon transformer for instance segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9131-9140. 2020.

[3] Peng, Sida, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and Xiaowei Zhou. "Deep snake for real-time instance segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8533- 8542. 2020.

[4] Yuan, Yuhui, Jingyi Xie, Xilin Chen, and Jingdong Wang. "Segfix: Modelagnostic boundary refinement for segmentation." In European Conference on Computer Vision, pp. 489-506. Springer, Cham, 2020.

[5] Homayounfar, Namdar, Yuwen Xiong, Justin Liang, Wei-Chiu Ma, and Raquel Urtasun. "Levelset r-enn: A deep variational method for instance segmentation." In European Conference on Computer Vision, pp. 555-571. Springer, Cham, 2020. [6] Takikawa, Towaki, David Acuna, Varun Jampani, and Sanja Fidler. "Gated- scnn: Gated shape cnns for semantic segmentation." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5229-5238. 2019.

Lane marker detection methods are subject-matter of the following documents:

[7] Yoo, Seungwoo, Hee Seok Lee, Heesoo Myeong, Sungrack Yun, Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim. "End-to-end lane marker detection via row-wise classification." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1006-1007. 2020.

[8] Van Gansbeke, Wouter, Bert De Brabandere, Davy Neven, Marc Proesmans, and Luc Van Gool. "End-to-end lane detection through differentiable least-squares fitting." In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0-0. 2019.

[9] Liu, Ruijin, Zejian Yuan, Tie Liu, and Zhiliang Xiong. "End-to-end lane shape prediction with transformers." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3694-3702. 2021 .

[10] Wang, Ze, Weiqiang Ren, and Qiang Qiu. "Lanenet: Real-time lane detection networks for autonomous driving." arXiv preprint arXiv: 1807.01726 (2018).

Implicit shape models (ISMs) are subject-matter of the following documents:

[11] Chen, Zhiqin, and Hao Zhang. "Learning implicit fields for generative shape modeling." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939-5948. 2019.

[12] Michalkiewicz, Mateusz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. "Implicit surface representations as layers in neural networks." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4743-4752. 2019. Periodic encoding in implicit shape models is subject-matter of the following documents:

[13] Benbarka, Nuri, Timon Hofer, and Andreas Zell. "Seeing Implicit Neural Representations as Fourier Series." arXiv preprint arXiv:2109.00249 (2021 ).

[14] Sitzmann, Vincent, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. "Implicit neural representations with periodic activation functions." Advances in Neural Information Processing Systems 33 (2020).

Grid-structured implicit shape models are subject-matter of the following documents:

[15] Jiang, Chiyu, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nieftner, and Thomas Funkhouser. "Local implicit grid representations for 3d scenes." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001

[16] Ibing, Moritz, Isaak Lim, and Leif Kobbelt. "3D Shape Generation with Gridbased Implicit Functions." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13559-13568. 2021.

[17] DeVries, Terrance, Miguel Angel Bautista, Nitish Srivastava, Graham W. Taylor, and Joshua M. Susskind. "Unconstrained Scene Generation with Locally Conditioned Radiance Fields." arXiv preprint arXiv:2104.00670 (2021 ).

Implicit shape models for segmentation are subject-matter of the following documents:

[18] Costain, Theo W., and Victor Adrian Prisacariu. "Towards Generalising Neural Implicit Representations." arXiv preprint arXiv:2101 .12690 (2021 ). [19] Kohli, Amit Pal Singh, Vincent Sitzmann, and Gordon Wetzstein. "Semantic implicit neural scene representations with semi-supervised training." In 2020 International Conference on 3D Vision (3DV), pp. 423-433. IEEE, 2020.

Hierarchical structures for implicit shape models are subject-matter of the following document:

[20] Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. "Focal loss for dense object detection." In Proceedings of the IEEE international conference on computer vision, pp. 2980-2988. 2017.

SUMMARY OF THE INVENTION

The object of the invention is to provide an improved method for semantic segmentation of an image.

To achieve this object, the invention provides a computer-implemented method for semantic segmentation of an image according to claim 1 . A data processing device, a computer program and a computer-readable data carrier are subject- matter of the parallel claims.

Advantageous embodiments of the invention are subject-matter of the dependent claims.

Hereinafter, the term “implicit shape model” is used to designate the concept of describing a shape with a function, i.e. , a neural network, that maps image coordinates to occupancy.

The term “hypernetwork” is used to designate the concept of using the output of one neural network to dynamically generate the parameters of another (usually smaller) network. The second neural network can be used to define an implicit shape model. The term “hierarchical generation” is used to designate the concept that generates a low-resolution image before using the low-resolution image as a guide for creating a higher resolution image.

The term “feature map” is used to designate the generalization of the concept of an image of higher dimensions. An image may be represented as a multidimensional array of size H x Wx 3 where “three” (3) represents red, green, and blue color channels. In a feature map, there may be thousands of channels.

The term “structured up-sampling” is used to designate the kernel that is composed of a sum of other kernels. Normally, in a neural network, up-sampling may be performed by suitably selected convolutional layers (also referred to as deconvolutions or transposed convolutions). In contrast, structured up-sampling enforces a pattern in both the vertical and the horizontal directions.

A basis expansion means a concept of converting coordinates into a higher dimensional representation. For example, a sine or cosine function may be computed at different frequencies for each element of the higher dimensional representation.

In the following, the labels a), b), b1 ), c) etc. are used as a reference for labelling steps of a method. The labels are not intended to limit the scope of the invention to a specific order or sequence of steps. Furthermore, a plurality of the steps may be performed sequentially or in parallel, if not indicated otherwise.

In one aspect, the invention provides a computer-implemented method for semantic segmentation of an image, the image including a plurality of pixels, a position of each pixel being represented by two-dimensional pixel coordinates, the method comprising: a) Feeding image data of the plurality of pixels to an initial convolutional neural network in order to create feature map data of the plurality of pixels; b) Generating a hypernetwork having at least one hyperlayer, the at least one hyperlayer including a neural network, the generating including: c) Feeding the feature map data of a selected pixel as a parameter and a set of periodic basis functions as an input to the neural network, the set of periodic basis functions representing the pixel coordinates of the selected pixel; and d) Generating a segmentation mask based on an output of the neural network.

Preferably, step c) comprises: c1 ) Feeding the feature map data and the set of periodic basis functions to the neural network for each of the plurality of pixels separately.

Preferably, the set of periodic basis functions include sine and cosine functions, and preferably a base harmonic of the set of periodic basis functions matches a resolution of the feature map data.

Preferably, step d) comprises:

Feeding the output of the neural network to a normalization function, preferably to a softmax function.

Preferably, step c) comprises: c2) Feeding the output of the neural network to a convolutional neural network, the at least one hyperlayer including the convolutional neural network, the convolutional neural network preferably being a spatial convolutional neural network.

Preferably, step b) comprises: b1 ) Generating the hypernetwork with a plurality of connected hyperlayers, each hyperlayer including a neural network, the output of the neural network preferably being fed to a convolutional neural network.

Preferably, the initial convolutional neural network includes a residual neural network layer and/or the initial convolutional neural network includes a two- dimensional convolutional neural network layer.

Preferably, the method further comprises: e) Before generating the hypernetwork or feeding the feature map data to the neural network, performing an up-sampling of the feature map data in order to generate up-sampled feature map data, the up-sampling using at least one kernel.

Preferably, the at least one kernel is periodic.

Preferably, the method further comprises: f) Training the initial convolutional network and the hypernetwork simultaneously.

Preferably, step f) comprises: f1 ) During training, adding a random phase offset to the at least one kernel.

Preferably, step e) comprises: e1 ) Using separate kernels for each of the dimensions; and e2) Up-sampling the feature map data with a sum of the separate periodic kernels.

Preferably, the method further comprises one or both of the following: g) Pooling the feature map data and performing step c) with the pooled feature map data; and h) Generating a plurality of hypernetworks with one or more times pooled feature map data and hierarchically structuring the generated hypernetworks.

In another aspect, the invention provides a data processing device comprising means for carrying out the method of any of the preceding embodiments.

In another aspect, the invention provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of the preceding embodiments.

In another aspect, the invention provides a computer-readable data carrier having stored thereon the computer program.

Embodiments of the invention preferably have the following advantages and effects: The invention preferably provides a decoder layer for semantic segmentation neural networks which is better able to handle fine structures with smoothness constraints.

Preferably, the invention provides a method for combining together grid-based conditional shape models with periodic activations in a way that retains the benefits of both variants. Preferred embodiments of the invention further propose a hierarchically structure of grid-based implicit shape models.

Preferably, the invention applies ISMs to segmentation using the following features simultaneously:

- a grid-based structure of the ISM forward model: this allows different models to handle different parts of the image; and

- a periodic encoding on the input of the ISM forward model: this makes it easier for the neural network to learn boundaries across multiple distance scales.

In other words, embodiments of the invention preferably combine a spatial structuring of a grid-based hypernetwork with a periodic encoding of the positional inputs. By ensuring the harmonics of the periodic encoding are congruent, embodiments of the invention may ensure that the implicit shape model is consistent across the entire image, i.e. , the issue of artifacts introduced by the grid-based representation may be attenuated.

Embodiments of the invention further may propose a hierarchical generation structure as an additional mechanism that may ensure consistency across multiple distance scales. In embodiments of the invention, hypernetwork layers may be interleaved per pixel with spatial convolutional neural network layers.

Embodiments of the method proposed herein preferably have the following processing steps of the neural network structure: - a feature map may be obtained by processing an image or other sensor data through a standard neural network architecture;

- a structured up-sampling may be performed on the feature map to obtain a grid of hypernetwork parameters at the desired output resolution;

- the up-sampling may be performed in a way that introduces periodic structure into the outputted hypernetwork parameters;

- for each element of the feature map, a fully connected layer is employed, the layer generating the output for all the pixels corresponding to that element of the feature map;

- constraints may be introduced so that the parameters of the structured up- sampling vary smoothly within each cycle of repetition.

In embodiments of the invention, a basis expansion of image pixel coordinates may be computed as inputs to the hypernetwork. With the hypernetwork parameters for each pixel and the basis expansion for each pixel, a semantic classification may be computed for the pixel.

This may make use of both, (hyper-)parameter network blocks and regular convolutional neural network (CNN) blocks where the convolutions may be performed over the regular image grid. Afterwards, standard neural network operations like the softmax function may be performed in order to generate the final segmentation or output mask.

By combining a conditional implicit rendering approach together with a CNN approach, the neural network may delegate the final semantic mask decoding to a specialized layer which provides a more natural encoding that promotes spatial and thin structure consistency.

Preferred embodiments of the invention may have the following further advantages:

- a periodic structuring of the generated hyperparameters such that the implicit shape model can better interface with the periodic basis functions used to encode the image or pixel coordinates; - instead of dividing the operating region into patches, the hypernetwork parameters are varied smoothly for every rendered pixel. Thus, more attention is paid to the boundaries between individual hypernetworks;

- spatial convolutional neural network layers are interspersed between pixel-wise neural network layers of the hypernetwork so that a grid-structured prediction problem is solved;

- a hierarchical structuring of implicit shape models may be employed, which may generalize to even larger variations in distance scale.

In terms of solving the segmentation problem, the method may be also applied to other segmentation problems where spatial consistency is of high importance, for example in biomedical image analysis or lane structure detection.

Combining a periodic activation with a grid structured representation may be also useful for neural radiance fields for rendering alternate viewpoints in 3D scenes. This may be especially applicable for cases where having an accurate spatial correctness of the scene is more important than displaying photorealistic textures on rendered objects.

The computer program employing the described method may run on a mobile robot computer to enable navigation. The computer program may also run on a server maintaining a global grid map of the operating environment of the mobile robot.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are now explained in more detail with reference to the accompanying drawings of which

Fig. 1 shows a computer-implemented method for semantic segmentation of an image according to an embodiment of the invention;

Fig. 2 shows a computer-implemented method for up-sampling feature map data; Fig. 3 shows a computer-implemented method for training; and

Fig. 4 shows a computer-implemented method for generating a plurality of hypernetworks and hierarchically structuring the plurality of hypernetworks.

DETAILED DESCRIPTION OF EMBODIMENTS

Fig. 1 shows a computer-implemented method for semantic segmentation of an image 10 according to an embodiment of the invention.

The image 10 (not shown) has a height H and a width W and is provided in form of a plurality of pixels 12. For example, the image 10 can have a height H of 256 pixels 12 and a width W of 256 pixels 12. Thus, the image 10 may include 256 x 256 pixels. A position 14 of each pixel 12 can be represented in form of two- dimensional pixel coordinates 16.

Image or input data 18 include information 20 on the plurality of pixels 12. For example, a color 22 of a pixel 12 can be described in form of red, green, and blue proportions. Thus, the image data 18 can be represented in form of an array 24 of 256 x 256 x 3 dimensions.

In a first step of the method, the image data 18 are fed to an initial convolutional neural network 26 in order to create feature map data 28 of the plurality of pixels 12.

The initial convolutional neural network 26 includes in the present embodiment a two-dimensional convolutional neural network layer 30. The feature map data 28 can be represented in form of an array 24. Depending on the two-dimensional convolutional neural network layer 30, the feature map data 28 can have dimensions different from that of the image data 18. For example, the feature map data 28 can be provided in form of an array 24 of 128 x 128 x 64 dimensions. The initial convolutional neural network 26 may further include a residual neural network layer 32, which again may change the dimensions of the feature map data 28.

In a second step of the method, an up-sampling is performed on the feature map data 28. In this context, up-sampling means to make the dimensions of the feature map data 28 equal to that of the image data 18. The up-sampling generates up- sampled feature map data 30.

In a third step of the method, a hypernetwork 36 is generated. In this context, a hypernetwork 36 means a second neural network that has as input 38 or parameters 40 an output 42 of a first neural network. A hypernetwork 36 is build by at least one hyperlayer 44 including the second neural network.

In a forth step of the method, the up-sampled feature map data 30 are fed as parameter 40 to a neural network 46 of the hyperlayer 44. The neural network 46 may be a standard neural network. In the embodiment shown in Fig. 1 , the neural network 46 is a multilayer perceptron 48.

The up-sampled feature map data 30 of a selected pixel 12 are fed to the neural network 46 or the multilayer perceptron 48 as parameter 40.

Furthermore, a set of periodic basis functions 50 is fed to the neural network 46 or the multilayer perceptron 48 as input 38. The set of periodic basis functions 50 are representing the pixel coordinates 16 of the selected pixel 12.

In other words, a basis expansion of the pixel coordinates 16 of the selected pixel 12 is computed as input 38 for the hypernetwork 36. Examples for periodic basis functions are sine and cosine functions at different frequencies. A base harmonic of the set of periodic basis functions 50 should match the resolution of the up- sampled feature map data 30.

In a fifth step of the method, the output 42 of the neural network 46 or multilayer perceptron 48 is fed to a spatial convolutional neural network 52. The hyperlayer 44 thus includes the neural network 46 or multilayer perceptron 48 and the convolutional neural network 52.

The forth and the fifth step of the method are performed for each of the plurality of pixels 12 separately.

In a sixth step of the method, further hyperlayers 44 of the hypernetwork 36 which are connected to each other, may be generated. Each hyperlayer 44 takes as input 38 the output 42 of the preceding hyperlayer 44.

In a seventh step of the method, the output 42 of the hypernetwork 36 is fed to a normalization function 54 such as a softmax function. This generates a segmentation or output mask 56. Thus, the hypernetwork 36 can be regarded as an implicit shape model 58.

Reference is now made to Fig. 2 which shows a computer-implemented method for up-sampling the feature map data 28.

The up-sampling of the feature map data 28 is performed by using at least one kernel 60. In the embodiment shown in Fig. 2, separate kernels 60a, 60b for each dimension are used. Furthermore, the kernels 60a, 60b are periodic. The feature map data 28 are up-sampled with a sum of the separate kernels 60a, 60b.

Fig. 3 shows a computer-implemented method for training.

The initial convolutional neural network 26 and the hypernetwork 36 are trained simultaneously. During training, a random phase offset 62 is added to the kernels 60a, 60b. For example, a rollover can be used. The random phase offset 62 allows the neural network 46 to learn some equivariance to translation, while each element of the kernel 60a, 60b is enabled to specialize in maintaining consistency with adjacent elements of the kernel 60a, 60b.

The random phase offset 62 is applied to both, the up-sampling parameters or the up-sampling kernel 60 and/or the positional encodings, i.e. , the set of periodic basis functions 50 or the basis expansion of the pixel coordinates 16. After the training, the random phase offset 62 is corrected before processing.

Fig. 4 shows a computer-implemented method for generating a plurality of hypernetworks 36 and hierarchically structuring the plurality of hypernetworks 36.

For generating the plurality of hypernetworks 36 or the plurality of implicit shape models 58, a first hypernetwork 64 is generated with the preceding method. A first set of periodic basis functions 66 is fed to the neural network 46 of the first hypernetwork 64.

Furthermore, the feature map data 28 of the first hypernetwork 64 are pooled for a first time to generate once pooled feature map data 68. In this context, pooling means to subtract information 20 from the feature map data 28. For example, information 20 can be subtracted from the feature map data 28 by performing a max pooling.

The once pooled feature map data 68 are then used to generate a second hypernetwork 70. A second set of periodic basis functions 72 is fed to the neural network 46 of the second hypernetwork 70. The second set of periodic basis functions 72 has a base harmonic with a frequency lower than that of the first set of periodic basis functions 66.

Furthermore, the once pooled feature map data 68 are pooled for a second time to generate twice pooled feature map data 74. The twice pooled feature map data 74 are then used to generate a third hypernetwork 76. A third set of periodic basis functions 78 is fed to the neural network 46 of the third hypernetwork 76. The third set of periodic basis functions 78 has a base harmonic with a frequency lower than that of the second set of periodic basis functions 72.

The pooling of the feature map data 28 and the generating of the hypernetworks 36 can be repeated arbitrary times. In this way, the hypernetworks 36 render at different distance scales. The generated hypernetworks 64, 70, 76 are connected to each other. In other words, the output 42 of the third hypernetwork 76 is up-sampled and fed to the second hypernetwork 70. The output 42 of the second hypernetwork 70 is up- sampled and fed to the first hypernetwork 64. This results in a hierarchical structure.

For example, the output 42 of the hypernetwork 36 at coarser distance scales could be treated as an input 38 for the hypernetwork 36 at finer distance scales, in addition to the pixel coordinates 16 of the selected pixel 12. In this way, consistency over large distance scales may be handled.

The invention further provides a data processing device 80 (not shown) comprising means for carrying out the method for semantic segmentation of the image.

The invention further provides a computer program 82 comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method.

The computer program 82 may be written in any suitable programming language and may have a form which is encoded in the following text as a pseudo-code 86:

// Pseudo-code explanations:

// H, W: Height and width of the input image

// n_class: Number of output classes from the semantic segmentation image

// n_feat: Dimension of the feature map

// os: "Overall stride" of the feature map relative to the input image

// "%": remainder after division at the beginning of an explanation, otherwise integer division // indentation: indication of a block n_pos_embedding: number of elements inside the positional embedding func execute_pixelwise_hypernetwork( image_network: ResNet50, upsample_kernel: (Array[4, n_feat, 2], Array[16, n_feat, 2]), implicit_network_conv_layers: [Conv2D], input mage: Array[H, W, 3], is_training: Boolean

) -> Array[H, W, n_class]: features_raw: Array[H / os, W / os, n_feat] = network.forward(input_image. pixels)

// The number of layers is arbitrary, we use two for simplicity for i_layer = 1 ..2: if i ayer == 1 : upsample_rate = 4 else: upsample_rate = 16 overall_stride = 16 / upsample_rate upsampled_features:

Array[H // overall_stride, W H overall_stride, n_feat] if is_training:

// the interpretation is that each element of upsample_kernel specialises // in maintaining consistency with adjacent elements of

11 upsample_kernel upsample_kernel[i_layer, :, 0] = random_roll(upsample_kernel[i_layer, :, 0]) upsample_kernel[i_layer, :, 1] = random_roll(upsample_kernel[i_layer, :, 1]) upsampled_features[i, j, :] = (

(upsample_kernel[i_layer][i % upsample_rate, :, 0] + upsample_kernel[i_layer][j % upsample_rate, :, 1]) * features[i // upsample_rate, j // upsample_rate, :]) image_coords:

Array[H // overall_stride, W // overall_stride, n_pos_embedding] image_coords[i, j, :] =

[sin(pi * i I upsample_rate), cos(pi * j I upsample_rate), sin(2 * pi * i

I upsample_rate), cos(2 * pi * j / upsample_rate), sin(3 * pi * i / upsample_rate), ...] if i_layer == 1 : state = image_coords else: state = concat(upsample( state, upsample_factor=4), image_coords) for (i, j) = "each pixel coordinate": state[i, j, :] = MLP(inputs = statefi, j], params = upsampled_features[i, j]) state = implicit_network_conv_layers[i_layer].forward(state) return softmax(state) func train_network( images: Sequence[Array[H, W, 3]], labels: Sequence[Array[H, W, n_class]]):

// Note "ResNet50" includes the global pooling layers but not the final

// softmax layer. Standard initialization's like Xavier initialization are used.

11 feature_dim may be 4096 image_network = ResNet50() upsample_kernel = [UpsampleParameter(), UpsampleParameter()] implicit_network_conv_layers = [SmallConv2DNetwork(), SmallConv2DNetwork()]

// n_train_iter may be set to 10 A 6 for i_train_iter = O..n_train_iter: image, label = random_sample(images, labels) predictions = execute_network( image_network, upsample_kernel, implicit_network_conv_layers, input mage, is_training = True,

) loss = cross_entropy_loss( predictions, label

)

// ADAM learning rate may be set to 10 A -4, other ADAM parameters // may be set to default values adam_optim([image_network, fusion_network], loss)

In the pseudo-code 86, “image_network” indicates the initial convolutional neural network 26 which is, in this case, based on “ResNet50()”. “ResNet50()” may be build up by a plurality of layers. An example of layers of “ResNet50()” is shown in the following Table 1 :

Table 1 : Layer structure of the initial convolutional neural network 26.

In the pseudo-code 86, “features_raw” indicate the feature map data 28.

Furthermore, the counter “i ayer” is counted from 1 to 2. This means that in the present case, the hypernetwork 36 includes two hyperlayers 44, the hyperlayer “1” and the hyperlayer “2”.

In the pseudo-code 86, “upsample_kernel” indicates the kernels 60a, 60b which is, in this case the “upsample_kernel[i_layer, :, 0]” and the “upsample_kernel[i_layer, :, 1]” for the separate dimensions “0” and “1”.

An example of how the “upsample_kernel[1 , :, 0]” and the “upsample_kernel[1 , :, 1]” for hyperlayer “1” are built up, is shown in the following Table 2: Table 2: Layer structure of the kernels 60a, 60b for hyperlayer “1”. An example of how the “upsample_kernel[2, :, 0]” and the “upsample_kernel[2, :, 1]” for hyperlayer “2” are built up, is shown in the following Table 3:

Table 3: Layer structure of the kernel 60a, 60b for hyperlayer “2”.

In the present case, the up-sampling is performed for each hyperlayer 44 and with different kernels 60a, 60b per hyperlayer 44. This means, the up-sampling is performed before feeding the feature map data 28 to the neural network 46.

“upsampled_features” indicates the up-sampled feature map data 30. The up- sampled feature map data 30 is then fed to the neural network 46. Thus, in the code implementation, the structured up-sampling layers and the MLP layers may be combined together to save memory space.

In the pseudo-code 86, “MLP” indicates the neural network 46, in this case the multilayer perceptron 48. Since the kernels 60a, 60b are different for each hyperlayer 44, the inputs of the neural networks 46 or multilayer perceptrons 48 differ for each hyperlayer 44. “MLP” of the hyperlayer “1” may be built up by a plurality of layers. An example of layers of the “MLP” for hyperlayer “1” is shown in the following Table 4. The total number of parameters of the “MLP” for hyperlayer “1” is 2128.

Table 4: Layer structure of neural network 46 or multilayer perceptron 48 for hyperlayer “1”.

An example of layers of the “MLP” for hyperlayer “2” is shown in the following Table 5. The total number of parameters of the “MLP” for hyperlayer “2” is 3152.

Table 5: Layer structure of the neural network 46 or the multilayer perceptron 48 for hyperlayer “2”. In the pseudo-code 86, “implicit_network_conv_layers” indicates the convolutional neural network 52. “implicit_network_conv_layers” may be build up by a plurality of layers. An example of layers of “implicit_network_conv_layers” for hyperlayer “1” is shown in the following Table 6:

An example of layers of “implicit_network_conv_layers” for hyperlayer “2” is shown in the following Table 7:

Here, since the outputs of the multilayer perceptron 48 of hyperlayer “1” and “hyperlayer “2” have similar dimensions, the convolutional neural networks 52 of hyperlayer “1” and hyperlayer “2” are similar.

The invention further provides a computer-readable data carrier 84 (not shown) having stored theron the computer program 82. REFERENCE SIGNS

10 image

12 pixel

14 position

16 pixel coordinate

18 image or input data

20 information

22 color

24 array

26 initial convolutional neural network

28 feature map data

30 up-sampled feature map data

32 two-dimensional convolutional neural network layer

34 residual neural network layer

36 hypernetwork

38 input

40 parameter

42 output

44 hyperlayer

46 neural network

48 multilayer percepetron

50 set of periodic basis functions

52 convolutional neural network

54 normalization function

56 segmentation or output mask

58 implicit shape model

60 kernel

60a, 60b kernel

62 random phase offset

64 first hypernetwork

66 first set of periodic basis functions

68 once pooled feature map data

70 second hypernetwork

72 second set of periodic basis functions 74 twice pooled feature map data

76 third hypernetwork

78 third set of periodic basis functions

80 data processing device 82 computer program

84 computer-readable data carrier

86 pseudo-code

H height W width