Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR GENERATING A GROUP EQUIVARIANT CONVOLUTIONAL NEURAL NETWORK
Document Type and Number:
WIPO Patent Application WO/2017/142397
Kind Code:
A1
Abstract:
The invention provides a device for generating a convolutional neural network, said device comprising a computer device comprising a computer program which, when running on said computer device, defines said convolutional neural network, said convolutional neural network comprising at least one non-translational group convolution layer, said non-translational group convolution layer taking as input a set of learnable filters and an input stack of feature maps, both inputs being a vector- valued function on a discrete set of sampling points having a set of symmetry transformations that form a mathematical group of symmetry transformations, with the exclusion of mathematical groups of symmetry transformations that are pure translation groups, and said non-translational group convolution layer producing as output an output stack of feature maps being a vector-valued function on said mathematical group of symmetry transformations.

Inventors:
COHEN TACO SEBASTIAAN (NL)
WELLING MAX (NL)
Application Number:
PCT/NL2017/050084
Publication Date:
August 24, 2017
Filing Date:
February 10, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SCYFER B V (NL)
International Classes:
G06F17/15; G06N3/02; G06N3/10
Domestic Patent References:
WO2014035738A12014-03-06
Foreign References:
EP2833295A22015-02-04
EP2869239A22015-05-06
EP2911111A22015-08-26
Other References:
ROBERT GENS ET AL: "Deep Symmetry Networks", 1 January 2014 (2014-01-01), XP055252284, Retrieved from the Internet [retrieved on 20160222]
SANDER DIELEMAN ET AL: "Exploiting Cyclic Symmetry in Convolutional Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 February 2016 (2016-02-08), XP080682215
TYLER HIGHLANDER ET AL: "Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 January 2016 (2016-01-25), XP080680179
CHIYUAN ZHANG ET AL: "Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations", INTERSPEECH 2015 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION DRESDEN, GERMANY, 6 September 2015 (2015-09-06), pages 3229 - 3233, XP055328747, Retrieved from the Internet [retrieved on 20161213]
OREN RIPPEL ET AL: "Spectral Representations for Convolutional Neural Networks", CORR (ARXIV), vol. abs/1506.03767v1, 11 June 2015 (2015-06-11), pages 1 - 10, XP055294145
TACO S COHEN ET AL: "Group Equivariant Convolutional Networks", 24 February 2016 (2016-02-24), pages 1 - 12, XP055328734, Retrieved from the Internet [retrieved on 20161213]
ROBERT GENS ET AL., DEEP SYMMETRY NETWORKS, 1 January 2014 (2014-01-01)
SANDER DIELEMAN ET AL.: "Exploiting Cyclic Symmetry in Convolutional Neural Networks", 8 February 2016, ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA
TYLER HIGHLANDER ET AL.: "Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add", 25 January 2016, ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA
CHIYUAN ZHANG ET AL.: "Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations", INTERSPEECH 2015 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION DRESDEN, 6 September 2015 (2015-09-06), pages 3229 - 3233, XP055328747
COXETER, H. S..; MOSER, W. O.: "Generators and Relations for Discrete Groups", 1980, SPRINGER BERLIN HEIDELBERG
SCHATTSCHNEIDER, D: "The Plane Symmetry Groups: Their Recognition and Notation", vol. 85, 1978, AMERICAN MATHEMATICAL MONTHLY, pages: 439 - 450
HAHN, T: "International Tables for Crystallography", 2002, SPRINGER NETHERLANDS
IOFFE, S.; SZEGEDY, C: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML, 2015
CHIRIKJIAN, G. S.; KYATKIN, A. B: "Engineering Applications of Noncommutative Harmonic Analysis", 2001, CRC PRESS
"Proceedings of the 1995 DIMACS Workshop on Groups and Computation", vol. 28, 1995, pages: 329 - 370
KYATKIN, A. B.; CHIRIKJIAN, G. S., PATTERN MATCHING AS A CORRELATION ON THE DISCRETE MOTION GROUP. COMPUTER VISION AND IMAGE UNDERSTANDING, vol. 74, no. 1, 1999, pages 22 - 35
MASLEN, D.; ROCKMORE, D: "Proceedings of the DIMACS Workshop on Groups and Computation", 1995, article "Generalized FFTs -- a survey of some recent results", pages: 183 - 237
"Recent Progress and Applications in Group FFTS", NATO SCIENCE SERIES II: MATHEMATICS, PHYSICS AND CHEMISTRY, vol. 136, pages 227 - 254
LAROCHELLE, HUGO; ERHAN, D; COURVILLE, AARON; BERGSTRA, JAMES; BENGIO, YOSHUA: "An empirical evaluation of deep architectures on problems with many factors of variation", PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML'07, 2007
SCHMIDT, UWE; ROTH, STEFAN: "Learning rotation-aware features: From invariant priors to equivariant descriptors", PROCEEDINGS OF THE IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2012
Attorney, Agent or Firm:
VAN ESSEN, Peter Augustinus (NL)
Download PDF:
Claims:
Claims

1. A device for generating a convolutional neural network, said device comprising a computer device comprising a computer program which, when running on said computer device, defines said convolutional neural network, said convolutional neural network comprising at least one non-translational group convolution layer, said non-translational group convolution layer taking as input a set of learnable filters and an input stack of feature maps, both inputs being a vector-valued function on a discrete set of sampling points having a set of symmetry

transformations that form a mathematical group of symmetry transformations, with the exclusion of mathematical groups of symmetry transformations that are pure translation groups, and said non-translational group convolution layer producing as output an output stack of feature maps, said output stack of feature maps being a vector-valued function on said mathematical group of symmetry transformations, said non-translational group convolution layer comprising inner products between said input stack of feature maps and each transformed filter from a set of transformed filters, said set of transformed filters comprising learnable filters from said set of learnable filters to which a transformation from said group of symmetry transformations has been applied, and wherein each said inner product produces a result to be stored at a position in said output stack of feature maps, and wherein the result of each inner product is stored in said output stack of feature maps at the position associated with the learnable filter and the symmetry transformation that was applied to the learnable filter to obtain the transformed filter that was used in said inner product.

2. The device of claim 1, wherein said group of symmetry transformations is a

symmorphic group, wherein each symmetry transformation in said symmorphic group is the composition of one symmetry transformation from the stabilizer of the origin and one translation, wherein the stabilizer of the origin is the subgroup of said group of symmetry transformations that contains all symmetry

transformations that leave invariant the origin of said discrete set of sampling points, wherein the origin is a fixed but arbitrarily designated point from said discrete set of sampling points, and wherein said non-translational group convolution layer is implemented by first applying every transformation from the stabilizer of the origin to each filter in order to obtain a set of stabilizer transformed filters, and then using a known translational convolution algorithm to said set of stabilizer transformed filters and said input stack of feature maps.

3. The device of claim 1 or 2, wherein said non-translational group convolution layer is implemented using the Fast Group Fourier Transform (FGFT) and its inverse, the Inverse Fast Group Fourier Transform (IFGFT), further comprising:

- applying the FGFT for said non-translational group of symmetry transformations to both said input stack of feature maps and said set of learnable filters, thus obtaining for said input stack of feature maps and each said learnable filter a matrix- valued function, and

- computing the matrix product of said Fourier transform of said input stack of feature maps with each of said Fourier transforms of said set of learnable filters, thus obtaining the Fourier transform of the group convolution of said input stack of feature maps with said learnable filters, and finally computing the Inverse Fast Group Fourier Transform (IFGFT) for said non-translational group of symmetry transformations.

4. The device of any one of the preceding claims, wherein a learned bias term is added to one or more of said output stack of feature maps, wherein in particular only a single bias term per feature map is added in the output stack of feature maps.

5. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of integers Z, and wherein said group of symmetry transformations is the group which consists of all compositions of integer translations and reflections about any point in the discrete set of sampling points, in particular the group consisting of all composition of integer addition or multiplication by -1.

6. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of sampling points in a square lattice, and wherein said group of symmetry transformations is the group p4 which consists of all compositions of integer translations in two orthogonal directions and rotations by multiples of 90 degrees about any point in the discrete set of sampling points. 7. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of points in a square lattice, and wherein said group of symmetry transformations is the group p4m which consists of all compositions of integer translations in two orthogonal directions, rotations by multiples of 90 degrees about any point in the discrete set of sampling points.

8. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of points in a hexagonal lattice, and wherein said symmetry group is the group p6 which consists of compositions of integer translation in the two independent directions of said hexagonal lattice and rotations by multiples of 60 degrees about any point in said hexagonal lattice.

9. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of points in a hexagonal lattice, and wherein said group of symmetry transformations is the group p6m which consists of compositions of integer translations in the two independent directions of said hexagonal lattice and rotations by multiples of 60 degrees about any point in said hexagonal lattice and reflections in mirrors that are aligned with the translation axes of said hexagonal lattice. 10. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of points in a rhombic lattice, or a square lattice, or a hexagonal lattice, or a rectangular lattice, or a parallelogrammic lattice, and said group of symmetry transformations is, respectively, the group cmm, p4m, p6m, pmm, or p2, or any of the subgroups of these groups that are not pure translation groups, in particular each of the 17 wallpaper groups except the group pi of pure translations.

11. The device of any one of the preceding claims, wherein said discrete set of sampling points is the set of points in a three-dimensional cubic lattice, and wherein said group of symmetry transformations is the group which consists of all compositions of integer translations in the three orthogonal directions of said lattice, rotations by multiples of 90 degrees about any axis attached to any point in said lattice.

12. The device of any one of the preceding claims, wherein said discrete set of

sampling points is the set of points in a three-dimensional cubic lattice, and wherein said group of symmetry transformations is the group which consists of all compositions of integer translations in the three orthogonal directions of said lattice, rotations by multiples of 90 degrees about any axis attached to any point in said lattice, and reflections in any axis-aligned mirror. 13. The device of any one of the preceding claims, wherein said discrete set of

sampling points is a three-dimensional orthorhombic or tetragonal lattice, and wherein said input stack of feature maps comprises voxels in a digital magnetic resonance image. 14. The device of any one of the preceding claims, wherein said discrete set of

sampling points is a three-dimensional Bravais lattice, and wherein said group of symmetry transformations is the group which consists of all symmetries of said lattice, or any of the subgroups of these groups that are not pure translation groups.

15. The device of any one of the preceding claims, wherein said discrete set of

sampling points consist of the elements in any of the 17 wallpaper groups except the pure translation group, or any of the 230 space groups except the pure translation groups, or any of the two-dimensional point groups or any of the three- dimensional point groups.

16. The device of any one of the preceding claims, wherein said group of symmetry transformations is one of the 6 Frieze groups pi 1, pi lg, plml, p2, p2mg, pi lm, p2mm, or any of their subgroups that are not pure translation groups.

17. The device of any one of the preceding claims, wherein said discrete set of

sampling points consists of the elements of a square, rectangular, rhombic, parallelogrammic or hexagonal lattice, and wherein said input stack of feature maps comprises pixels of a digital image having one or more color channels.

18. The device of any one of the preceding claims, wherein G-pooling is applied to at least one of the feature maps from said output stack of feature maps, in particular said pooling is selected from any one of max-pooling, sum pooling, square pooling and a combination thereof.

19. The device of any one of the preceding claims, wherein said data point is a spatio- temporal signal, in particular said spatio-temporal signal is selected from any one of audio spectrograms, video, and fMRI sequences, and wherein more in particular said discrete set of sampling points comprises points at different points in time which are included in the group convolution.

An image classification device comprising the device of any one of the preceding claims for digital images, wherein said symmetry group is p4m, and wherein said input stack of feature maps of a first non-translational group convolution layer is a digital image, and subsequent non-translational group convolution layers have p4m symmetry, and between each non-translational group convolution layer is a layer of non-linearities.

A classification method comprising applying the device of any one of the preceding claims to a data item to obtain a classification, wherein said data item is an input stack of feature maps. 22. A data segmentation method comprising applying the device of any one of the preceding claims including claims 2 and 18, wherein the output of said

convolutional neural network is obtained by G-pooling over said stabilizer of the origin, thus obtaining a segmentation map that transforms in the same way as the input stack of feature maps to the first layer of the network under transformations from said group of symmetry transformations.

23. A regression method comprising applying the device of any of the preceding claims to a data item to obtain a numerical estimate of a quantity associated with the data item.

24. A data analysis method comprising the device of any one of the preceding claims, wherein the output of the convolutional neural network is a feature vector that is used for further computational analysis of input of the convolutional neural network.

-o-o-o-o-o-

Description:
Device and method for generating a group equivariant convolutional neural network

Field of the invention

The invention relates to a device for generating a convolutional neural network, a computer program product for generating a convolutional neural network, a series of devices for solving machine learning problems including but not limited to classification, regression, segmentation, detection, labelling, and prediction problems.

Background of the invention

In the art, many different types of neural networks are known. These networks consist of a sequence of data processing steps called layers, each of which takes as input one or more outputs from previous layers and/or one or more data points that represent the input to the entire network. Typically, some of the layers employ parameters that are learned from data by a learning algorithm. Neural networks have been proven valuable for modelling data that cannot easily be described by simple mathematical models. Neural networks have a potentially large number of adjustable parameters that can be adapted by a learning algorithm so as to optimize the performance of the network on some task, such as object recognition, speech recognition, or computer-aided diagnosis based on medical scans. Learning is typically performed by gradient-based optimization algorithms, where the gradient is computed using the backpropagation algorithm.

Deep convolutional neural networks (CNNs, convnets) in particular have proven to be very powerful models for sensory data such as images, video, and audio. Although a strong theory of neural network design is currently lacking, a large amount of empirical evidence supports the notion that both convolutional parameter sharing and network depth (among other factors) are important for good predictive performance.

Over the last few years, convolutional neural networks have established a significant breakthough. Examples of convolutional neural networks and their applications are for instance described in EP2833295.

EP2833295 according to its abstract describes a convolutional-neural-network- based classifier, a classifying method by using a convolutional-neural-network-based classifier and a method for training the convolutional-neural-network-based classifier. The convolutional-neural-network-based classifier comprises: a plurality of feature map layers, at least one feature map in at least one of the plurality of feature map layers being divided into a plurality of regions; and a plurality of convolutional templates corresponding to the plurality of regions respectively, each of the convolutional templates being used for obtaining a response value of a neuron in the corresponding region.

EP2869239 according to its abstract describes systems, methods, and non- transitory computer readable media that can align face images, classify face images, and verify face images by employing a deep neural network (DNN). A 3D-aligned face image can be generated from a 2D face image. An identity of the 2D face image can be classified based on provision of the 3D-aligned face image to the DNN. The identity of the 2D face image can comprise a feature vector

EP2911111 according to its abstract describes an apparatus and a method for lesion detection are provided. The method of lesion detection involves detecting lesion candidates from a medical image, detecting anatomical objects from the medical image, verifying each of the lesion candidate based on anatomical context information comprising information regarding a location relationship between the lesion candidates and the anatomical objects, and removing one or more false positive lesion candidate from the detected lesion candidates based on a verification result.

Robert Gens ET AL: "Deep Symmetry Networks", of 1 January 2014, discloses according to its abstract "The chief difficulty in object recognition is that objects' classes are obscured by a large number of extraneous sources of variability, such as pose and part deformation. These sources of variation can be represented by symmetry groups, sets of composable transformations that preserve object identity. Convolutional neural networks (convnets) achieve a degree of translational invariance by computing feature maps over the translation group, but cannot handle other groups. As a result, these groups' effects have to be approximated by small translations, which often requires augmenting datasets and leads to high sample complexity. In this paper, we introduce deep symmetry networks (symnets), a generalization of convnets that forms feature maps over arbitrary symmetry groups. Symnets use kernel-based interpolation to tractably tie parameters and pool over symmetry spaces of any dimension. Like convnets, they are trained with backpropagation. The composition of feature transformations through the layers of a symnet provides a new approach to deep learning. Experiments on NORB and MNIST-rot show that symnets over the affine group greatly reduce sample complexity relative to convnets by better capturing the symmetries in the data."

SANDER DIELEMAN ET AL: "Exploiting Cyclic Symmetry in Convolutional Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 February 2016, discloses according to its abstract "Many classes of images exhibit rotational symmetry. Convolutional neural networks are sometimes trained using data augmentation to exploit this, but they are still required to learn the rotation equivariance properties from the data. Encoding these properties into the network architecture, as we are already used to doing for translation equivariance by using convolutional layers, could result in a more efficient use of the parameter budget by relieving the model from learning them. We introduce four operations which can be inserted into neural network models as layers, and which can be combined to make these models partially equivariant to rotations. They also enable parameter sharing across different orientations. We evaluate the effect of these architectural modifications on three datasets which exhibit rotational symmetry and demonstrate improved performance with smaller models."

TYLER HIGHLANDER ET AL: "Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 January 2016, discloses according to its abstract "Convolutional neural networks (CNNs) are currently state-of-the-art for various classification tasks, but are computationally expensive. Propagating through the convolutional layers is very slow, as each kernel in each layer must sequentially calculate many dot products for a single forward and backward propagation which equates to 0(N 2 n 2 ) per kernel per layer where the inputs are N x N arrays and the kernels are n x n arrays. Convolution can be efficiently performed as a Hadarnard product in the frequency domain. The bottleneck is the transformation which has a cost of 0(N 2 log 2 N) using the fast Fourier transform (FFT). However, the increase in efficiency is less significant when N » n as is the case in CNNs. We mitigate this by using the "overlap-and-add" technique reducing the computational complexity to 0(N 2 log 2 n) per kernel. This method increases the algorithm's efficiency in both the forward and backward propagation, reducing the training and testing time for CNNs. Our empirical results show our method reduces computational time by a factor of up to 16.3 times the traditional convolution implementation for a 8 x 8 kernel and a 224 x 224 image."

Chiyuan Zhang ET AL: "Discriminative Template Learning in Group-

Convolutional Networks for Invariant Speech Representations", INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association Dresden, Germany, 6 September 2015 (2015-09-06), pages 3229-3233, discloses according to its abstract "In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets. "

Summary of the invention

The invention seeks to provide a collection of convolutional neural network layers that enjoy greater statistical efficiency for problem domains that have symmetry, which means that the network requires fewer training examples to learn to perform its task at a certain level of proficiency in such symmetric problem domains.

To that end, the invention relates to a device for generating a convolutional neural network, said device comprising a computer device comprising a computer program which, when running on said computer device, defines said convolutional neural network, said convolutional neural network comprising at least one non- translational group convolution layer, said non-translational group convolution layer taking as input a set of learnable filters and an input stack of feature maps, both inputs being a vector-valued function on a discrete set of sampling points having a set of symmetry transformations that form a mathematical group of symmetry transformations, with the exclusion of mathematical groups of symmetry transformations that are pure translation groups, and said non-translational group convolution layer producing as output an output stack of feature maps, said output stack of feature maps being a vector-valued function on said mathematical group of symmetry transformations, said non-translational group convolution layer comprising inner products between said input stack of feature maps and each transformed filter from a set of transformed filters, said set of transformed filters comprising learnable filters from said set of learnable filters to which a transformation from said group of symmetry transformations has been applied, and wherein each said inner product produces a result to be stored at a position in said output stack of feature maps, and wherein the result of each inner product is stored in said output stack of feature maps at the position associated with the learnable filter and the symmetry transformation that was applied to the learnable filter to obtain the transformed filter that was used in said inner product.

The invention provides a collection of convolutional neural network layers that enjoy greater statistical efficiency for problem domains that have symmetry, which means that the network requires fewer training examples to learn to perform its task at a certain level of proficiency in such symmetric problem domains. Symmetry in this context means that the correct output for the network does not change (or changes in a predictable, equivariant way) when certain transformations are applied to the input of the network. For example, in an image classification problem, the network has to predict the class of a visual object, which typically does not depend on the position and orientation of the object in the image. As another example, in an image segmentation problem, the output of the network is a segmentation mask, and this mask should undergo a rotation when the input image is rotated.

It was found that known convolutional parameter sharing is effective because there is a translation symmetry in most vision problems: the label function and the data distribution are both approximately invariant to shifts. By using the same parameters to analyze each part of the image, a convolution layer uses far fewer parameters than a fully connected layer, while preserving the capacity to learn many useful data transformations.

It was further found that convolution layers can be used effectively in a deep network because all the layers in such a network are translation equivariant: shifting the image and then feeding it through a number of convolution layers gives the same result as feeding the original image through the same layers and then shifting the result. In other words, the symmetry (translation) is preserved by each layer, which makes it possible to exploit it by parameter-sharing not just in the first but also in higher layers of the network.

It was further found that convolutional layers can be generalized to exploit larger groups of symmetries, including rotations, reflections and glide reflections, thus further increasing the degree of parameter sharing. This is the main invention described herein, and is referred to as a G-convolution or G-correlation layer, the resulting networks being called G-CNNs. Furthermore, other operations typically used in CNNs, such as pooling, bias, and batch normalization layers are generalized so as to be compatible with (i.e. equivariant) G-convolution and G-correlation layers.

The word 'equivariance' is not commonly used in the deep learning literature, so we briefly discuss it here. Transforming an input x by a transformation g (forming T g x) and then passing it through a network or layer φ should give the same result as first mapping x through the network and then transforming the representation:

Layer or network φ is then called an equivariant map.

Invariance is a special kind of equivariance where T g is the identity transformation for all g. In deep learning, general equivariance is more useful than invariance because it is impossible for a layer to determine if low level features are in the right spatial configuration if they are invariant. To make it easier to introduce the invention, we first describe a model of known CNNs as used in computer vision, and describe their translation equivariance. In this model, stacks of feature maps are modeled as functions supported on a

bounded domain supp(f). In this respect, both the input (e.g. an image) and the intermediate representations of the input as produced by the layers of the network are considered to be feature maps. We think of the integers as the coordinates of a (typically square) lattice of sampling points. At each sampling point, the stack of feature maps stores a real K-dimensional vector, where K is the number of color channels of the input image (e.g. K=l for grayscale, K=3 for color images), or the number of feature maps in higher layers. Although the function f extends to infinity, we only have to store a finite array in computer memory because the function is nonzero only on a small domain supp(f). In this text, whenever we refer to a stack of feature maps or to a vector-valued function on a discrete set of sampling points, we are referring to a function that is zero everywhere outside of a finite support, and hence can be stored in a finite array in computer memory.

For transformations of the stack of feature maps, the following notation for a transformation g acting on a stack of feature maps f is introduced:

The same definition is valid for feature maps on a group G or discrete set of sampling points that will be introduced later. In these cases x is not a point in Z 2 , but a point in said group or discrete set, and the juxtaposition g _1 x denotes the group operation (when x is a group element) or to the group action (when x is a point in a discrete set of sampling points on which G acts). For feature maps on the groups p4 and p4m, this is visualized in figure 1.

A standard inner product on the space of Z 2 stacks of feature maps is defined by:

Again, the generalization to feature maps on other discrete sets and groups than Z 2 is straightforward: simply replace the sum over Z 2 by a sum over said discrete space or group. At each layer 1 of a regular CNN, the CNN takes as input a stack of feature maps (or an image) and convolves or correlates it with a set of Ki+i filters

If one employs convolution (*) in the forward pass, the correlation ( * ) will appear in the backward pass when computing gradients, and vice versa. Both operations are very similar, and, following standard practice in the literature on CNNs, we will often simply use the word "convolution" to refer to any convolution-type operation including the correlation operation. The invention described herein is not limited to any one particular correlation or convolution-type operation. Here we use the convention that the correlation is used in the forward pass.

Using the substitution y→ y + t, and leaving out the summation over feature maps for clarity, we see that a translation t followed by a correlation is the same as a correlation followed by a translation:

And so we say that "correlation is an equivariant map for the translation group", or that "correlation and translation commute".

The known translational correlation operation defined before is computed by shifting a filter and then computing an inner product with the stack of input feature maps. By replacing the shift by a more general transformation from some group G, we get the

G-correlation used that can be applied to a function on (e.g. an image):

Notice that both the input image f and the filter ψ are functions of the plane but the feature map is a function on the discrete group G (which may contain

translations as a subgroup). Hence, for all layers after the first, the filters must also be functions on G, and the correlation operation becomes

The scope of the invention also includes G-convolution-like sums where a different group action than g _1 h or hg is used, and further includes G-convolution-like sums where the arguments f and ψ are swapped. All of these operations enjoy similar equivariance properties that are easily derived by direct computation (as shown below for the G-correlation).

The equivariance of both the first-layer G-correlation and the full G-correlation operation is derived using the substitution h→uh:

for the first layer correlation, the inputs f and are functions on

denotes the transformation of such a function, while denotes the

transformation of the feature map (see figures 1 and 2 for an illustration), which is a function on G. For the full G-correlation, both the inputs f and ψ and the output 3 are functions on G.

More generally, we can define a G-convolution layer for vector-valued functions on any discrete set of sampling points which has a group of symmetries (i.e. a set of transformations that map the set of sampling points onto itself, where the set of transformations is closed (in the mathematical sense) under composition of transformations and inverses). Since for such a "symmetric sampling set", there is a well defined action of the group on the sampling set, we can transform also a function defined on this sampling set using the operator L g defined before, which can then be used to define an equivariant G-correlation and G-convolution as shown before (replacing the sum over Z 2 or G by a sum over said set of sampling points). The G- correlation and G-convolution of functions on Z 2 and G are special cases of this, because indeed a group of symmetries of Z 2 acts on Z 2 , and indeed G acts on G by left multiplication. Note further that for a given set of sampling points, there may be multiple groups of symmetries. For instance, a square grid Z 2 has a group of symmetries that includes translations, rotations and reflections, but also a group that includes only translation, or only translations and rotations. Each of these leads to a different G-convolution or G-correlation layer.

There are many examples where this generic G-convolution or G-correlation is useful. There are many types of data that can have many kinds of symmetry. For example, one could have images sampled on a hexagonal lattice, and a learning problem which has a translation, rotation and reflection symmetry. In this case, one could use a G-CNN with G=p6m, the group of symmetries of a hexagonal lattice. Or one might have a 3D medical scan which has a different resolution along the z-axis than along the x and y-axes. Such a lattice does not have full 3D rotation symmetry, but it admits transformations like for instance vertical reflections and horizontal rotations, among other transformations.

In general, depending of the type of data and the symmetry of the problem, one can choose to use any finite group, any Frieze group (groups of symmetries with one independent translation) except the pure ID translation Frieze group pi for which the CNN is known, any Wallpaper group (groups of symmetries with two independent translations) except the pure 2D translation wallpaper group pi for which the CNN is known, or any space group (except pure translation groups, for which the CNN is known). The mentioned Frieze, wallpaper and space groups have all been classified mathematically, and there exist exactly 7 Frieze groups, 17 wallpaper groups, and 230 space groups (also including pure translation groups). Precise definitions of these groups can be found in (Coxeter, H. S. ., & Moser, W. O. . (1980). Generators and Relations for Discrete Groups. Springer Berlin Heidelberg; Schattschneider, D. (1978). The Plane Symmetry Groups: Their Recognition and Notation. American Mathematical Monthly, 85, 439-450; Hahn, T. (2002). International Tables for Crystallography. Springer Netherlands). These references are incorporated by reference as if fully set forth.

It is customary to add a bias term to each feature map in a convolution layer.

This can be done for G-convolution layers as well, but this operation must be adapted so as to use only one bias per G-feature map (not one bias per 2D feature map, if the feature map on G is represented as an array of dimension greater than 2). Likewise, batch normalization (Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the International Conference on Machine Learning (ICML), incorporated by reference as if fully set forth.) must be generalized to use only a single bias and scale parameter per G-feature map.

Known CNNs typically involve layers of non-linearities, which can be used in G-CNNs without modification. While the layer does not change, we show below that it is compatible with G-convolution layers, in the sense that pointwise non-linearities are G-equivariant maps. The nonlinearity v : is applied to a feature map f : G -> R by composition to give the rectified feature map

one can see that

so the rectified feature map inherits the transformation properties of the previous layer (i.e. it is G-equivariant, and we may thus stack further G-convolution layers on top).

In order for instance to reduce the size of the feature maps, pooling can be applied in a known convolutional neural network. In such a known convolutional neural network, pooling as such is known in the art. In a G-CNN, we must use a generalized form of pooling which we call G-pooling. This can be used with any of the common pooling operators (as long as they are invariant to permutations of the pooling domain), that as such is known in the art, including max-pooling, sum- pooling, and square-pooling.

In order to simplify the analysis, we split the pooling operation as it is usually conceived of into two steps: the pooling itself (performed without stride), and a subsampling step. The non-strided max-pooling operation applied to a feature map f : G→ R can be modeled as an operator P that acts on f as

where is the g-transformation of some pooling domain U a

G (typically a neighborhood of the identity transformation). In a regular convnet U is usually a 2 x 2 or 3 x 3 square including the origin (0, 0). This operation too commutes with transformations:

Since pooling tends to reduce the variation in a feature map, it makes sense to sub- sample the pooled feature map, or equivalently, to do a "pooling with stride". In a G-

CNN, the notion of "stride" is generalized by subsampling on a subgroup H c G. That is, H is a subset of G that is itself a group (i.e. closed under multiplication and inversion). The resulting subsampled feature map is then equivariant to H but not equi variant to G.

In a standard convnet, pooling with stride 2 is the same as pooling and then

subsampling on H = {(2i, 2j) | (i, j) e Z 2 } which is a subgroup of G = Z 2 . For the p4-

CNN (p4 being the group of translations and 90 degree rotations), we may for instance subsample on the subgroup H containing all 4 rotations and shifts by multiples of 2 n pixels, after n pooling layers. We can obtain full G-equivariance by choosing our pooling region U to be a subgroup H c G, so that we are pooling over cosets gH={gh | heH}. For example, in a p4-CNN, we can pool over all four rotations at each spatial position (the cosets of the subgroup R of rotations around the origin). Or, in a Z 2 -CNN, we can pool over the cosets of the subgroup of even horizontal shifts. The resulting coset-pooled feature map is invariant to the right-action of H, because the pooling domains (cosets) are similarly invariant (ghH = gH), so we can arbitrarily choose a set of coset-representatives to subsample on. The resulting feature map may be thought of as a function on the quotient space G/H, which caries a natural G-action: g(g' H) = (gg')H. For p4 and R, we have p4/R = Z 2 , i.e. the feature map resulting from pooling over cosets gR has the same

transformation law as the input image.

This concludes our definition of the various layers in a G-CNN (G-convolution or G-correlation, non-linearities and G-pooling, G-batch normalization), and the analysis of their equivariance properties.

To train a group equivariant convolutional network using a gradient-based optimization algorithm, the gradient of a loss function with respect to the parameters of the filters must be computed. As usual, this is done using the backpropagation algorithm. The gradients are easily derived for any G-convolution or G-correlation layer using standard calculus. Here we describe the required computations for the G- correlation layer.

Let feature map k at layer 1 be denoted, is the previous feature map. At some point in the backprop algorithm, we will have computed the derivative for all k, and we need to compute S

to backpropagate to lower layers as well as (to update the parameters). We

find that, where the superscript * denotes the involution, and v^'is the set of filter components applied to input feature map j at layer 1: For the gradient with respect to the filters, we obtain:

So we see that both the forward and backward passes involve convolution or correlation operations, as is the case in the current generation of convnets. Thus, the backward pass may be implemented using the same convolution and correlation routines that can be used in the forward pass.

In an embodiment, said group of symmetry transformations is a symmorphic group, wherein each symmetry transformation in said symmorphic group is the composition of one symmetry transformation from the stabilizer of the origin and one translation. In this, in an embodiment the stabilizer of the origin is the subgroup of said group of symmetry transformations that contains all symmetry transformations that leave invariant the origin of said discrete set of sampling points. In particular, the origin is a fixed but arbitrarily designated point from said discrete set of sampling points. In particular, the non-translational group convolution layer is implemented by first applying every transformation from the stabilizer of the origin to each filter in order to obtain a set of stabilizer transformed filters, and then using a known translational convolution algorithm to said set of stabilizer transformed filters and said input stack of feature maps.

In an embodiment, the non-translational group convolution layer is implemented using the Fast Group Fourier Transform (FGFT) and its inverse, the Inverse Fast Group Fourier Transform (IFGFT), further comprising:

- applying the FGFT for said non-translational group of symmetry transformations to both said input stack of feature maps and said set of learnable filters, thus obtaining for said input stack of feature maps and each said learnable filter a matrix-valued function on the unitary dual of said non-translational group of symmetry

transformations, and

- computing the matrix product of said Fourier transform of said input stack of feature maps with each of said Fourier transforms of said set of learnable filters, thus obtaining the Fourier transform of the group convolution of said input stack of feature maps with said learnable filters, and finally computing the Inverse Fast Group Fourier Transform (IFGFT) for said non-translational group of symmetry transformations. In an embodiment, a learned bias term is added to one or more of said output stack of feature maps. In particular only a single bias term per feature map is added in the output stack of feature maps.

In an embodiment, the discrete set of sampling points is the set of integers Z. In particular, the group of symmetry transformations is the group which consists of all compositions of integer translations and reflections about any point in the discrete set of sampling points, in particular the group consisting of all composition of integer addition or multiplication by -1.

In an embodiment, the discrete set of sampling points is the set of sampling points in a square lattice. In particular, the group of symmetry transformations is the group p4 which consists of all compositions of integer translations in two orthogonal directions and rotations by multiples of 90 degrees about any point in the discrete set of sampling points.

In an embodiment, the discrete set of sampling points is the set of points in a square lattice. In particular the group of symmetry transformations is the group p4m which consists of all compositions of integer translations in two orthogonal directions, rotations by multiples of 90 degrees about any point in the discrete set of sampling points.

In an embodiment, the discrete set of sampling points is the set of points in a hexagonal lattice. In particular, the symmetry group is the group p6 which consists of compositions of integer translation in the two independent directions of said hexagonal lattice and rotations by multiples of 60 degrees about any point in said hexagonal lattice.

In an embodiment, the discrete set of sampling points is the set of points in a hexagonal lattice. In particular, the group of symmetry transformations is the group p6m which consists of compositions of integer translations in the two independent directions of said hexagonal lattice and rotations by multiples of 60 degrees about any point in said hexagonal lattice and reflections in mirrors that are aligned with the translation axes of said hexagonal lattice.

In an embodiment, the discrete set of sampling points is the set of points in a rhombic lattice, or a square lattice, or a hexagonal lattice, or a rectangular lattice, or a parallelogrammic lattice. In particular, the group of symmetry transformations is, respectively, the group cmm, p4m, p6m, pmm, or p2, or any of the subgroups of these groups that are not pure translation groups, in particular each of the 17 wallpaper groups except the group pi of pure translations.

In an embodiment, the discrete set of sampling points is the set of points in a three-dimensional cubic lattice. In particular, the group of symmetry transformations is the group which consists of all compositions of integer translations in the three orthogonal directions of said lattice, rotations by multiples of 90 degrees about any axis attached to any point in said lattice.

In an embodiment, the discrete set of sampling points is the set of points in a three-dimensional cubic lattice. In particular, the group of symmetry transformations is the group which consists of all compositions of integer translations in the three orthogonal directions of said lattice, rotations by multiples of 90 degrees about any axis attached to any point in said lattice, and reflections in any axis-aligned mirror.

In an embodiment, the discrete set of sampling points is a three-dimensional orthorhombic or tetragonal lattice. In particular, the group of symmetry

transformations is the group which consists of all symmetries of said lattice, or any of the subgroups of these groups that are not pure translation groups.

In an embodiment, the discrete set of sampling points is a three-dimensional Bravais lattice. In particular, the group of symmetry transformations is the group which consists of all symmetries of said lattice, or any of the subgroups of these groups that are not pure translation groups.

In an embodiment, the discrete set of sampling points consist of the elements in any of the 17 wallpaper groups except the pure translation group, or any of the 230 space groups except the pure translation groups, or any of the two-dimensional point groups or any of the three-dimensional point groups.

In an embodiment, the group of symmetry transformations is one of the 6 Frieze groups pi 1, pi lg, plml, p2, p2mg, pi lm, p2mm, or any of their subgroups that are not pure translation groups.

In an embodiment, the discrete set of sampling points consists of the elements of a square, rectangular, rhombic, parallelogrammic or hexagonal lattice. In particular, the input stack of feature maps comprises pixels of a digital image having one or more color channels. In an embodiment, G-pooling is applied to at least one of the feature maps from said output stack of feature maps. In particular said pooling is selected from any one of max-pooling, sum pooling, square pooling and a combination thereof.

In an embodiment, the data point is a spatio-temporal signal. In particular said spatio-temporal signal is selected from any one of audio spectrograms, video, and fMRI sequences. More in particular, said discrete set of sampling points comprises points at different points in time which are included in the group convolution.

The invention further pertains to an image classification device comprising an embodiment of the device described above for classification of digital images, wherein said symmetry group is p4m, and wherein said input stack of feature maps of a first non-translational group convolution layer is a digital image, and subsequent non-translational group convolution layers have p4m symmetry, and between each non-translational group convolution layer is a layer of non-linearities.

The invention further pertains to a classification method comprising applying an embodiment of the device to a data item to obtain a classification, wherein said data item is an input stack of feature maps.

The invention further pertains to a data segmentation method comprising applying an embodiment of the device, wherein the output of said convolutional neural network is obtained by G-pooling over said stabilizer of the origin, thus obtaining a segmentation map that transforms in the same way as the input stack of feature maps to the first layer of the network under transformations from said group of symmetry transformations.

The invention further pertains to a regression method comprising applying an embodiment of the device to a data item to obtain a numerical estimate of a quantity associated with the data item.

The invention further pertains to a data analysis method comprising an embodiment of the device, wherein the output of the convolutional neural network is a feature vector that is used for further computational analysis of input of the convolutional neural network.

The person skilled in the art will understand the term "substantially" when used in this application, such as in "substantially encloses" or in "substantially extends up to". The term "substantially" may also include embodiments with "entirely",

"completely", "all", etc. Hence, in embodiments the adjective substantially may also be removed. Where applicable, the term "substantially" may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%. The term "comprise" includes also embodiments wherein the term "comprises" means "consists of.

Furthermore, the terms first, second, third and the like if used in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The device and method herein are amongst others described during operation. As will be clear to the person skilled in the art, the invention is not limited to methods of operation or devices in operation.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device or apparatus claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The invention further applies to a device or method or parts thereof comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The various aspects discussed in this patent can be combined in order to provide additional advantages. Furthermore, some of the features can form the basis for one or more divisional applications. Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, showing an embodiment of a construction element, and showing in:

Figure la: a scalar valued function on the group p4, which can be interpreted as a p4-feature map or as a p4-filter;

Figure lb: the same function after rotation by the r which denotes the 90 degree rotation. The 2d patches associated with a rotation (e, r, r2, r3) will follow the arrows while undergoing the rotation. The transformation law visualized here follows directly from the p4 group operation and the definition of Lg;

Figure 2a: a scalar valued function on the group p4m, which can be interpreted as a p4-feature map or as a p4-filter;

Figure 2b: the same function after rotation by the r which denotes the 90 degree rotation. The 2D patches associated with a rotation (e, r, r2, r3) will follow the arrows while undergoing the rotation;

Figure 2c: the same function as shown on the left after application of the mirroring operation m. The 2D patches follow the lines without arrowheads while undergoing a flip. The transformation law visualized here follows directly from the p4m group operation and the definition of L g , and

Figure 3 : flow chart describing the G-convolution or G-correlation as computed by G-FFT.

The drawings are not necessarily on scale.

Description of preferred embodiments

The G-convolution and correlation can be implemented straightforwardly using a loop or as a parallel GPU kernel. For each cell in the stack of output feature maps (which has indices for the filter dimension and one or more group dimensions such as rotation and x-translation and y-translation), one has to compute an inner product. The inner product (a sum over products of corresponding elements, as defined above), is computed between the stack of feature maps that is the input to the layer, and the transformed filter. Let g be the transformation that corresponds to a given cell in the output feature maps. Then we have to apply g to each un-transformed filter (which together form the set of parameters of the layer). This is done by, for each cell in the array that is to store the transformed filter (this cell being associated with the point h in the discrete sampling set on which the feature map is defined), doing a lookup in the un-transformed filter at the cell associated with the point g "1 h. Having done this for each cell in the array that stores the transformed filter, we have computed the transformed filter. The dot product between the g-transformed filter with the input feature map is then computed and the result is stored in the cell of the output feature map associated with g and the filter that was used.

For groups G that are symmorphic (meaning that every transformation in G can be written as the composition of an element of the stabilizer subgroup of the origin and a translation), we can compute the G-correlation using the Z 2 - correlation as a subroutine. This has the practical advantage that very efficient implementations of the Z 2 - correlation exist, so that less time has to be spent optimizing the code for the G- correlation.

This works as follows. The G-correlation can be split into a very cheap filter- indexing operation followed by a standard planar convolution.

For example, consider the group . A transformation h in p4 can be

decomposed into a translation t and a rotation r as h = tr. From this, and using the homomorphism property the G-correlation can be decomposed into two

parts:

The first part of the computation is to evaluate for each of four rotations r about the origin. If ψ is a function on it is clear what is meant by "rotating the

filter". For a filter ψ that is a function on the group C we can index the group by three integers Rotating such a function amounts to rotating each 2D

plane associated with a value of Θ, and cyclically shifting the planes along the Θ axis. In general, for other groups G, this transformation law follows directly from the definition of the group operation and the definition of the operator L g .

The second part of the computation is to evaluate the sum over t and k using an optimized implementation of the 2D convolution. To see how this can be done, note that the sum over amounts to a sum over axes. The Θ axis of both

the filter tensor and the feature maps can be folded into the feature axis (k) using a reshape operation. These tensors can then be passed to the convolution routines as if they represented number of 2D feature maps (for p4, which has a stabilizer

containing 4 elements; the rotations around the origin). The resulting 4 K 1+1 planar feature maps can be reshaped into K 1+1 number of feature maps on p4, each of which has a first axis of length 4. This approach naturally generalizes to any symmorphic group.

A third method for computing the G-convolution or G-correlation, which is asymptotically fastest, is to use a Group Fast Fourier Transform (G-FFT) algorithm and its inverse (G-IFFT), which are sometimes referred to in the literature as Generalized FFTs. It is well known in the literature that a Fourier transform can be defined not just for functions on Z 2 , but for functions on any locally compact topological group (which includes the discrete groups we make use of in this work). The Fourier transform of a function on G is an operator-valued function (in practice realized as a matrix-valued function) on what is known as the "unitary dual" of G (which is a discrete space whenever G is discrete). Furthermore, the well known convolution theorem, which says that the Fourier transform of the convolution of two functions is the product of their Fourier transforms, generalizes to the case of group Fourier transforms.

In the engineering literature, fast algorithms have been developed for computing the G-FFT and G-IFFT (Chirikjian, G. S., & Kyatkin, A. B. (2001). Engineering Applications of Noncommutative Harmonic Analysis (1st ed.). CRC Press; Rockmore, D. (1995). Some applications of generalized FFTs. In L. Finkelstein & W. Kantor (Eds.), Proceedings of the 1995 DEVIACS Workshop on Groups and Computation (Vol. 28, pp. 329-370); Kyatkin, A. B., & Chirikjian, G. S. (1999). Pattern Matching as a Correlation on the Discrete Motion Group. Computer Vision and Image Understanding, 74(1), 22-35; Maslen, D., & Rockmore, D. (1995). Generalized FFTs— a survey of some recent results. In L. Finkelstein & W. Kantor (Eds.), Proceedings of the DIMACS Workshop on Groups and Computation (pp. 183— 237); Rockmore, D. N. (2004). Recent Progress and Applications in Group FFTS. NATO Science Series II: Mathematics, Physics and Chemistry, 136, 227-254. These references are incorporated by reference as if fully set forth.

Since we always work with discrete functions that have bounded support, we can use the FFT for a finite group, even though these discrete functions are modeled as functions on an infinite discrete space or group. This can be achieved by appropriately zero-padding the function in the spatial dimensions, as is commonly done with ordinary FFTs as well. The computation of the G-convolution then proceeds as follows: first, the G-FFT of the input feature maps and filters are computed. The resulting spectral functions are multiplied (and since these functions are matrix-valued, multiplication means matrix-multiplication), and summed over the input feature maps. As is known in the literature, the G-correlation would require taking the conjugate transpose of the G-FFT of the filter, relative to the G- convolution. Finally, we apply the inverse G-FFT to obtain the desired result. This is illustrated in figure 3.

The benefit of using the G-FFT for computing the G-convolution is that it is asymptotically faster, but the other methods have merit because they are easier to implement and require less memory.

The gradient of a loss function with respect to the parameters (the filters) are easily computed for both G-correlation algorithms. In the first case, the gradient is obtained by a simple application of the rules of calculus, as derived previously. For the G-correlation implementation for symmorphic groups which uses existing planar convolution routines as a subroutine, the gradient of the planar convolution is known. The gradient for the filter-transformation is simply the gradient of an array indexing operation, which is also known. The gradient computation for the G-FFT implementation of the G-correlation is implemented by applying the inverses of the G-IFFT (which is the G-FFT) and the G-FFT (which is the G-IFFT) in reverse order to the ouput gradient. This will give the input gradient, because the G-FFT and its inverse are linear and unitary (hence the Jacobian of the G-FFT is the conjugate transpose of the G-FFT, which is the G-IFFT, and vice-versa for the G-IFFT).

Experimental The rotated MNIST dataset (Larochelle, Hugo, Erhan, D, Courville, Aaron, Bergstra, James, and Bengio, Yoshua. An empirical evaluation of deep architectures on problems with many factors of variation, Proceedings of the 24th International Conference on Machine Learning (ICML'07), 2007) contains 62000 randomly rotated handwritten digits. The dataset is split into a training, validation and test sets of size 10000, 2000 and 50000, respectively.

Because data augmentation could potentially reduce the benefits of using G- convolutions, all experiments reported in this section use random rotations on each instance presentation. We found that at least for the rotated MNIST dataset this does not make much difference.

We performed model selection using the validation set, yielding an architecture (Z2CNN, see table 1) that outperforms the models tested by Larochelle et al. (2007) (when trained on 12k and evaluated on 50k), but does not match the previous state of the art, which uses prior knowledge about rotations (Schmidt, Uwe and Roth, Stefan. Learning rotation-aware features: From invariant priors to equivariant descriptors. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012). The results are listed in the table below.

We then took this architecture and replaced each convolution by a p4- convolution followed by a max-pooling over all 4 rotation angles. The post-rotation- pooling feature maps transform in the same way as the input image (i.e. as a function on Z2 as opposed to a function p4), which means that all the convolutions used in this network have to be first-layer convolutions. The resulting architecture, P4CNNRotPool, has exactly the same number of parameters as the Z2CNN, but performs better: it roughly matches the state of the previous state of the art. Finally, we designed a network, P4CNN (table below), that uses full p4-convolutions. This network clearly outperforms the previous state of the art.

The work leading to this invention has received funding from the European Union Seventh Framework Program FP7/2007-2013 under grant agreement n° 632913, and by funding of the Innovative Public-Private Partnership in ICT (TPPSI) - KXEM, NWO Physical Sciences, under grant agreement n° IPPSI.KIEM.14.020".

It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection. Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this invention and are obvious combinations of prior art techniques and the disclosure of this patent.