DOLL KONRAD (DE)
CARLO RUSSO ET AL: "Spherical coordinates transformation pre-processing in Deep Convolution Neural Networks for brain tumor segmentation in MRI", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 August 2020 (2020-08-17), XP081742534
SALEHINEJAD HOJJAT ET AL: "Image Augmentation Using Radial Transform for Training Deep Neural Networks", 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 15 April 2018 (2018-04-15), pages 3016 - 3020, XP033401510, DOI: 10.1109/ICASSP.2018.8462241
ZHOU LIN ET AL: "Object Detection for Remote Sensing Images Based on Polar Coordinates", IEEE ACCESS, 21 September 2020 (2020-09-21), pages 1 - 12, XP055928499, Retrieved from the Internet
FACIL: "CAM-Convs: Camera Aware Multi-Scale Convolutions for Single-View Depth", ARXIΥ:1904.02028
FOURNIER ET AL., ACQUIRING HEMISPHERICAL PHOTOGRAPHS IN FOREST ENVIRONMENTS: FROM PLANNING TO ARCHIVING PHOTOGRAPHS
RUSSO ET AL., SPHERICAL COORDINATES TRANSFORMATION PRE-PROCESSING IN DEEP CONVOLUTION NEURAL NETWORKS FOR BRAIN TUMOR SEGMENTATION IN MRI
TSUNG- YI LIN ET AL., FEATURE PYRAMID NETWORKS FOR OBJECT DETECTION
CLAIMS A computer-implemented method for encoding projection properties associated with image data (14), said method comprising the steps of: a) determining a principal axis (A) of a projection model for obtaining the image data (14) from a scene; and b) determining, for each point in the image data (14), a deflection metric indicative of an angle (α) between the principal axis (A) and a projection ray (B) through said point; wherein the method is characterized by c) encoding the deflection metric for each point in the image data (14) as a data value in projection data (18). The method of claim 1, wherein the angle (α) and/or the deflection metric is mathematically equivalent or proportional to the zenith angle (α) of a spherical coordinate system, wherein the zenith is aligned with the principal axis (A) of the projection model. The method of claim 1 or 2, wherein the projection model is based on a cylindrical or spherical projection and in particular based on a pinhole camera model. The method of any one of the preceding claims, wherein the principal axis (A) goes through a center of a field of view of an imaging apparatus for recording the scene. The method of any one of the preceding claims, wherein equipotential lines (eα) of the deflection metric encoded for each point in the image data (14) approximate elliptic arcs around the principal axis (A) in an image of the scene. The method of any one of the preceding claims, wherein the method further comprises, determining a local gradient ( ∇α ) of the deflection metric at a certain point in the image data (14) based on the values of the deflection metric in neighboring points in the image data (14) for reconstructing a position of the certain point in the scene. The method of any one of the preceding claims, wherein the local gradient ( ∇α ) of the deflection metric at a given point is substantially aligned along a line through the principal point associated with the projection model, wherein a structure of the projection data (18) reflects a structure of the image data (14), such that a local operation on the deflection metric of neighboring points estimates the local gradient (∇α ), in particular using an image gradient operator on the deflection metrics of the given point and its direct neighbors in image data (14), wherein the image gradient operator is preferably a discrete differentiation operator for computing an approximation of the gradient (∇α ) of the deflection metric in the points of the image data (14), most preferably a Prewitt operator, a Sobel operator, a Scharr operator, or a Kayyali operator. The method of any one of the preceding claims, wherein the method further comprises recording a distance (R, D) between the camera and the projected object for each point in the image data (14). The method of any one of the preceding claims, wherein the method further comprises providing the image data (14) alongside the deflection metric to a machine learning classifier (24) for classifying objects in an image of the scene based on the image data (14). The method of any one of the preceding claims, wherein the method comprises a) receiving image data (14) for the scene and projection information (16) of an imaging system for recording the image data (14), in particular comprising a focal length, a pixel magnification, a skew, a principal point shift, a sensor dimension, an angular resolution, or a parametrization of lens distortions, of the imaging system, or a combination thereof; and b) determining the deflection metric (a, d(u)) for each point in the image data (14) based on the projection information (16) of the imaging system. The method of any one of the preceding claims, wherein the method comprises a) receiving three-dimensional point data of the scene; and b) calculating a projection of the three-dimensional point data on a two-dimensional image for obtaining two-dimensional image data (14) for the scene. A non-transitory medium comprising machine-readable instruction, which, when executed by a processing system (12), implement a method according to any one of the preceding claims. An image data (14) encoding system (10) comprising a processing system (12), wherein the processing system (12) is configured to a) receive image data (14) of a scene; b) determine a principal axis (A) of a projection model for projecting the scene onto a two-dimensional grid of pixels; and c) determine, for each pixel projected from the scene, a deflection metric (a, d(u)) indicative of an angle (α) between the principal axis (A) and a projection ray (B) through said pixel; wherein the system (10) is characterized in that the processing system (12) is further configured to d) encode the deflection metric as a data value associated with each pixel as projection data (18). The system (10) of claim 13, wherein the angle (α) and/ or the deflection metric (a, d(u ) is mathematically equivalent or proportional to the zenith angle (α) of a spherical coordinate system, wherein the zenith is aligned with the principal axis (A) of the projection model. The system (10) of claim 13 or 14 wherein the principal axis (A) goes through a center of a field of view of an imaging system for recording the scene. The system (10) of any one of claims 13 to 15, further comprising an imaging system for obtaining the image data (14) of the scene via a measurement, in particular comprising a camera and/ or a distance measuring device, and the processing system (12) is configured to receive the image data (14) from the imaging system. The system (10) of any one of claims 13 to 16, wherein the processing system (12) is further configured to provide the image data (14) alongside the deflection metric (a, d(u)) to a machine learning classifier (24) for classifying objects in an image of the scene based on the image data (14). The system (10) of any one of claims 13 to 17, wherein the processing system (12) is further configured to a) receive image data (14) for the scene and projection information (16) of an imaging system for recording the image data (14), in particular comprising a focal length, a pixel magnification, a skew, a principal point shift, a sensor dimension, an angular resolution, or a parametrization of lens distortions, of the imaging system, or a combination thereof; and b) determine the deflection metric for each point in the image data (14) based on the projection information (16) of the imaging system. The system (10) of any one of claims 13 to 18, wherein the processing system (12) is further configured to a) receive three-dimensional point data of the scene; and b) calculate a projection of the three-dimensional point data on a two-dimensional image for obtaining two-dimensional image data (14) for the scene. A data structure, comprising: a) image data (14), wherein the image data (14) comprises a plurality of image values arranged in a regular array, the regular array of image values forming a two- dimensional image, and characterized by b) projection data (18), wherein the projection data (18) comprises a plurality of deflection metric values arranged in a regular array reflecting the structure of the regular array of image values, and wherein the deflection metric values are each indicative of an angle (α) between a principal axis (A) of a projection model for obtaining the image data (14) from a scene and a projection ray (B) corresponding to the image value in the image data (14) at the same position as the deflection metric An image classifying system (10) comprising a machine learning classifier (24), wherein the machine learning classifier (24) is trained with image data (14) based on a two-dimensional grid of pixels, wherein each pixel is associated with a deflection metric indicative of an angle (α) between a principal axis (A) of a projection model for projecting the scene onto the two-dimensional grid of pixels and a projection ray (B) through said pixel, and characterized in that the deflection metric is provided to the machine learning classifier (24) as a data value alongside the image data (14) as an input. |
The deflection metric may be preconfigured in the imaging system, such as a camera, and may be encoded with the image data. However, the deflection metric may equally be calculated based on the projection information by a processing system internal or external to the imaging system.
In preferred embodiments, the method further comprises receiving three-dimensional point data of the scene, and calculating a projection of the three-dimensional point data on a two- dimensional image for obtaining two-dimensional image data for the scene.
In other words, the method may be equally applied to three-dimensional point data, e.g. as obtained by a radar or LiDAR system. Preferably, the projection results in the equivalent of an image taken by a camera, such that the image data is comparable between camera images and the aforementioned distance measuring systems. For example, the projection may project the three-dimensional point data of the scene onto a two-dimensional image, wherein the projection properties associated with the two-dimensional image are equivalent to a camera projection model.
For example, the Cartesian coordinates of the measurement points may be converted into spherical ones, e.g. through a transformation of a point Subsequently, a projection model may be used to generate a spherical range image whose points can be described by image point vectors u: wherein, analogous to the projection model of pinhole cameras, the projection matrix can describe a discretization Δφ , Δ θ along the angles φ, θ and a shift of the center coordinates C φ , ce defined by the height and width of the resulting image. Since the discretization can cause several points to be projected onto one pixel, only the points with the smallest Euclidean distance r to the sensor may be used. For a conventional spinning LiDAR sensor, the image height h and width w may be equivalent to the number of layers and azimuth increments, respectively.
With the spherical projection, an image representation I can be constructed. Points from a 3D point cloud and auxiliary data can be projected to this ordered image representation. They may result in several images for a LiDAR scan. The LiDAR scan may provide image values associated with the points, such as Ir for the measured Euclidean distance, and Iref for the reflectivity measure of the LiDAR. From I r and I ref an image representation I color may be constructed. For example, for the construction of the color image, I r may be used as the hue channel and Iref as the value or brightness channel of a HSV-colorspace. The construction of a color image can allow for human interpretability of the images. The HSV-colorspace may be converted to RGB. The RGB color space has been established for machine learning architectures, such that by using said standard, existing, pre-trained machine learning models may be used or adopted in conjunction with the method. The method may be implemented on a processing system. The processing system may comprise a single processing unit or may comprise a plurality of processing units, which may be functionally connected. The processing units may comprise a microcontroller, an ASIC, a PLA (CPLA), an FPGA, or another processing device, including processing devices operating based on software, hardware, firmware, or a combination thereof. The processing devices can include an integrated memory, or communicate with an external memory, or both, and may further comprise interfaces for connecting to sensors, devices, appliances, integrated logic circuits, other controllers, or the like, wherein the interfaces may be configured to receive or send signals, such as electrical signals, optical signals, wireless signals, acoustic signals, or the like. For example, the processing system may be connected to an imaging system, such as a camera, and a digital storage through a data interface for receiving the image data and projection information for the imaging apparatus. In some embodiments, the processing system comprises or communicates with a graphics processing unit and/or a neural processing unit and/or a deep learning processing unit, such as for implementing the machine learning classifier or performing other numerical calculations as part of the method. In some embodiments, a processing system receives the projection information comprising configuration information of the imaging system and retrieves an associated map of the deflection metric for the configuration information, which may have been previously generated, for determining the principal axis and determining the respective deflection metric values. In other words, in some embodiments, a pixel-wise mapping of the deflection metric is pre-generated and, in the method, the steps of determining the principal axis and determining the deflection metric are combined by retrieving the appropriate pixel-wise mapping of the deflection metric for the projection information. The pixel-wise mapping of the deflection metric may then be supplied alongside the image data to a machine learning classifier as an analysis set. According to a second aspect, the invention relates to a non-transitory medium comprising machine-readable instruction, which, when executed by a processing system, implement a method according to any one of the preceding embodiments of the method according to the first aspect. According to a third aspect, the invention relates to an image data encoding system comprising a processing system. The processing system is configured to receive image data of a scene, to determine a principal axis of a projection model for projecting the scene onto a two- dimensional grid of pixels, to determine, for each pixel projected from the scene, a deflection metric indicative of an angle between the principal axis and a projection ray through said pixel, and to encode the deflection metric for each pixel as projection data. In preferred embodiments, the angle and/or the deflection metric is mathematically equivalent or proportional to the zenith angle of a spherical coordinate system, wherein the zenith is aligned with the principal axis of the projection model. In preferred embodiments, the principal axis goes through a center of a field of view of an imaging system for recording the scene. In preferred embodiments, the system further comprises an imaging system for obtaining the image data of the scene via a measurement, in particular comprising a camera and/or a distance measuring device, and the processing system is configured to receive the image data from the imaging system. The processing system may be configured to provide the projection data alongside the image data to an analysis system, such as a machine learning classifier. In preferred embodiments, the processing system is further configured to provide the image data alongside the deflection metric to a machine learning classifier for classifying objects in an image of the scene based on the image data. In preferred embodiments, the processing system is further configured to receive image data for the scene and projection information of an imaging system for recording the image data, in particular comprising a focal length, a pixel magnification, a skew, a principal point shift, a sensor dimension, an angular resolution, or a parametrization of lens distortions, of the imaging system, or a combination thereof; and to determine the deflection metric for each point in the image data based on the projection information of the imaging system. In preferred embodiments, the processing system is further configured to receive three- dimensional point data of the scene; and to calculate a projection of the three-dimensional point data on a two-dimensional image for obtaining two-dimensional image data for the scene. According to a fourth aspect, the invention relates to a data structure comprising image data and projection data. The image data comprises a plurality of image values arranged in a regular array, the regular array of image values forming a two-dimensional image. The projection data comprises a plurality of deflection metric values arranged in a regular array reflecting the structure of the regular array of image values. The deflection metric values are each indicative of an angle between a principal axis of a projection model for obtaining the image data from a scene and a projection ray corresponding to the image value in the image data at the same position as the deflection metric. According to a fifth aspect, the invention relates to an image classifying system comprising a machine learning classifier, wherein the machine learning classifier is trained with image data based on a two-dimensional grid of pixels, wherein each pixel is associated with a deflection metric indicative of an angle between a principal axis of a projection model for projecting the scene onto the two-dimensional grid of pixels and a projection ray through said pixel, and wherein the deflection metric is provided to the machine learning classifier alongside the image data as an input. DETAILED DESCRIPTION OF EMBODIMENTS The features and numerous advantages of the method and system according to the present invention will best be understood from a detailed description of preferred embodiments with reference to the accompanying drawings, in which: Fig.1 schematically illustrates an example of a projection of a point in a three- dimensional scene onto an image plane according to a pinhole camera model; Fig.2A-C schematically illustrate examples of different representations of point locations in a spherical projection; Fig.3 illustrates an example of a spherical coordinate representation of a point in a view of an image plane using the projection of Fig.2C; Fig.4 illustrates an example of a computer-implemented method for encoding projection properties associated with image data; Fig.5 illustrates a schematic example of an encoding system for implementing the method illustrated in the example of Fig.4; Fig. 6 illustrates a schematic example of an analysis data set, in which an image of a scene is overlaid with an illustration of the corresponding deflection metric;
Fig. 7 illustrates an example of a classification system; and
Fig. 8 illustrates an example of a classification system including a modified machine learning model.
Fig. 1 schematically illustrates an example of a projection of a point in a three-dimensional scene onto a virtual image plane I according to a pinhole camera model. In the illustration, a virtual camera position C (e.g. the camera aperture position) coincides with the origin of a Cartesian coordinate system defined by x-, y-, and z-axes. The pinhole camera model is defined by a principal axis , which is represented by a dashed line through the camera position C and oriented along the z-axis of the coordinate system. A principal point P is defined as the intersection between the image plane I and the principal axis A.
The point is projected onto the projected point in the image plane I along a projection rayB, the projected point u having pixel coordinates u, v. The projection ray B and the principal axis A define an angle α between each other. The distance between the camera position C and the point may be defined as the shortest distance (with the camera position C located at the origin of the coordinate system), or may be defined as the projected distance D, i.e. the scalar product between the vector representation and a normalized vector associated with the principal axis A, which in the illustrated example simplifies to “Z”.
The 3D point can be projected onto two-dimensional projected coordinates x by means of a transformation towards homogeneous coordinates: such that all points of a scene can be projected onto a projected plane with normalized distance from the origin of the coordinate system, i.e. the virtual camera position C. Pixel coordinates u, v of the projected point can be determined from projected coordinates by applying the intrinsic matrix associated with the camera/projection model, and vice versa:
The intrinsic matrix IK, sometimes also referred to as “camera matrix”, provides a mapping between 3D coordinates of points in a scene (3D space) to 2D image coordinates, e.g. on a pixel camera sensor, and can encode the projection properties of an imaging system. The intrinsic matrix can be obtained via calibration, measurement of imaging parameters, modelling of the imaging system, calculation from camera/proj ection properties, or a combination thereof, e.g. in a calibration process which may also be called camera resectioning or (geometric) camera calibration. For image data projected from 3D point data, the intrinsic matrix may be directly calculated based on the projection properties.
Starting from a representation of the point in pixel (image) coordinates u, v, the distance in homogeneous coordinates of the projected plane may therefore be calculated using the inverse of the intrinsic matrix according to:
The distance is a normed distance in homogeneous coordinates, i.e. the distance between the camera position C and the principal point P is one. Accordingly, the angle α between the principal axis and the projection ray B can be calculated according to:
The angle α may describe an opening angle associated with a pixel of a camera in relation to the principal axis A. The angle α may correspond to a zenith angle (inclination) of a spherical coordinate representation of the point wherein the zenith axis is aligned along the principal axis of an imaging system, such as a camera.
Fig. 2A illustrates an example of a spherical coordinate representation of a three-dimensional vector in the context of a Cartesian coordinate system with axes x, y, z. The illustrated spherical coordinate system is defined by a zenith axis, aligned with the z-axis of the Cartesian coordinate system, as well as a reference direction (sometimes called azimuth reference), perpendicular to the zenith axis and corresponding to the x-direction in the illustrated example. The vector may then be described by the zenith angle α between the vector and the zenith direction, the azimuth angle β between the projection of the vector onto the x-y- plane and the x-axis, and the radial distance with x, y, and z being the respective values of the x-, y-, z-coordinates of the vector
Conventionally, as shown in the example of Fig. 2B, when describing the origin of a point u of a two-dimensional projection of a scene on an image plane I with spherical coordinates, the principal axis of the projection is aligned with the reference direction (i.e. the x-axis), and the pixel coordinate system (indicated by arrows “u, v”) is perpendicular to the reference direction and parallel to the zenith direction (i.e. parallel to the y-z plane). For example, the origin of a point in image coordinates may be described by an associated pair of a “pitch angle”, e.g. and a “yaw angle”, e.g. φ = β, in spherical coordinates. The “pitch angle” and the “yaw angle” may provide measures of an associated horizontal field of view (opening angle) and an associated vertical field of view (opening angle), respectively, from a camera position C for that point. Lines of constant “pitch angle” and lines of constant “yaw angle” are straight lines in the image plane I, aligned along the z- (v-) and the y- (u-) direction, respectively.
Fig. 2C illustrates an example, where the principal axis A is instead aligned with the zenith direction (“z”), such that the pixel coordinate system (indicated by arrows “u, v”) associated with the image plane I is parallel to the x-y- plane (not shown in Fig. 2C) and therefore parallel to the reference direction (parallel to “u”). In this illustrated case, the origin of each projected point u in the field of view may be described by the zenith angle a and the azimuth angle .
As opposed to the previous example, lines of constant zenith angle a are not straight, but form curved equipotential lines “e” in the image plane I around the principal point P, which may be elliptic or circular depending on the projection properties. Lines of constant azimuth angle β are straight lines in the image plane I, but may fan out circumferentially from the principal point P, wherein each value of the azimuth angle β is associated with an equipotential line (not shown in fig. 2C, but shown in Fig. 3) oriented differently in the image plane I.
As a result of the curvature of the equipotential lines “e” associated with a given value of the zenith angle α, the equipotential lines “e” may in general be oblique to a regular grid of a pixel matrix in which image data is usually captured and/ or stored. Moreover, and contrary to the example in Fig. 2B, the local directions of the respective equipotential lines “e” in a regular (e.g. rectangular) pixel grid may in general be different at different points
Notably, due to the definition of the coordinate system, at a given point the equipotential lines of the zenith angle a and the azimuth angle β may be substantially perpendicular to each other. Moreover, at a local approximation, the gradient of the zenith angle α and the gradient of the azimuth angle β may be substantially perpendicular to their respective equipotential lines. Hence, in principle by determining the local gradient of one of the angles α, β, the local direction of the equipotential lines e of the respective other angle β, α may be determined.
Fig. 3 illustrates an example of a spherical coordinate representation of a point u in an image plane I using the projection of Fig. 2C, with the principal point P representing the intersection of the image plane I with the principal axis A. The point lies on an equipotential line e a associated with a given value of the zenith angle a, and is further associated with an azimuth angle p (associated with equipotential line e β ). Vector arrows illustrate the local gradient of the zenith angle a and the azimuth angle β, respectively, at the point
As can be seen from the figure, the local gradient of the zenith angle a and the azimuth angle β will locally be aligned substantially along the equipotential lines e β , e α of the azimuth angle β and the zenith angle a, respectively. Specifically, the local gradient of the zenith angle a is substantially oriented along the vector i.e. it points radially outward from the principal point P, substantially aligned with an equipotential line e β of the azimuth angle β.
As each equipotential line e β of the azimuth angle β has a characteristic direction, by locally estimating the gradient of the zenith angle a, a local estimation of the azimuth angle β can be inferred.
The direction of the local gradient can be inferred from a convolutional operation on the values of the zenith angle α on a local portion of the pixel grid, e.g. by applying a Sobel operator on a matrix of the values of the zenith angle a at neighboring pixels. Specifically, with the Sobel filter the partial derivatives of the zenith angle a with respect to the grid axes u, v of the pixel grid may be determined by computing the convolution (*) in order to determine an approximation of the local gradient
Since the local gradient ∇α is aligned substantially along the vector pointing at point from the principal point P, and the zenith angle a provides a distance measure from the principal point P, the location of the point in the image plane I may therefore be estimated based on the knowledge of the values of the zenith angle α in a regular grid alone. Further, based on the direction of the vector, the azimuth angle β may also be calculated, e.g. to reconstruct a location of the point in the field of view of a camera in spherical coordinates.
In other words, although the zenith angle a in principle only represents one of the two angles α, β generally required to describe a point in spherical coordinates, local convolutional operations can be used to estimate the azimuth angle β when the zenith angle a is available for a regular grid of pixels, in principle as long as the zenith direction is oblique to the image plane I.
Hence, when the principal axis A is aligned along the zenith direction (“z”), the zenith angle a of a spherical coordinate system can be used to encode positional information in a pixel matrix used for recording image data with a single data channel. The zenith angle a inherently provides projection information as an angular reference to the principal axis A, indicating an opening angle associated with a respective pixel of a pixel matrix. The zenith angle a further provides a normalized distance measure, which can be used to generalize the processing of image data obtained between different imaging systems.
Fig. 4 illustrates an example of a computer-implemented method for encoding projection properties associated with image data. The method comprises determining a principal axis A of a projection model for obtaining image data from a scene (Sio), and determining, for each point in the image data, a deflection metric indicative of an angle a between the principal axis A and a projection ray through said point (S12). The method then comprises encoding the deflection metric for each point in the image data as projection data (S14).
The deflection metric may be selected as the zenith angle a or another distance measure indicative of the zenith angle a, such as a distance between the respective point and the principal point P in the image plane I, when the distance between the origin of the coordinate system (e.g. the virtual camera position C) and the projected plane is a normalized value. For example, may be provided in homogeneous coordinates (i.e. normalized value equal to “one”). The deflection metric may be calculated with respect to the principal axis A, which advantageously corresponds to an optical axis of an imaging system for recording the image data. The deflection metric may then be recorded as an array of deflection metric values in projection data. The array may reflect the shape of an array of data values in image data, such that each point in image data can be associated with a corresponding value of the deflection metric in projection data. The projection data can be encoded as an additional data channel with the image data, or can be encoded in a separate format, e.g. in a separate file, for providing the projection data to a processing unit, such as a machine learning classifier. Fig.5 illustrates a schematic example of an encoding system 10 for implementing the method illustrated in the example of Fig. 4. The system 10 comprises a processing system 12 for receiving image data 14 and projection information 16. The processing system 12 may comprise an ASIC or a microcontroller, which may receive the image data 14 from an imaging system, e.g. a camera (not shown). The imaging system may be associated with the processing system 12 in a common mobile platform, or the image data 14 may be transferred from the imaging system to the processing system 12 over a communication network for processing of the image data 14 in a remote location. The image data 14 may be accompanied by projection information 16, or respective projection information 16 may be recorded for the imaging system at the processing system 12 or at another digital storage location. The processing system 12 may determine the principal axis A associated with a projection model for obtaining the image data 14. For example, the processing system 12 may determine the principal axis A as a normal through a center of the field of view associated with the image data 14 for estimating the position and orientation of the optical axis of an associated camera. The processing system 12 may also select the principal axis A based on received information, e.g. provided with the image data 14 or the projection information 16. For example, the processing system 12 may determine the principal axis A based on the projection information 16, e.g. based on information on misalignment of lenses or an imaging sensor array with respect to the optical axis of the imaging system. As a specific example, the projection information 16 may comprise the intrinsic matrix ^^ of the imaging system, and the processing system 12 may determine a location of the principal axis A based on the principal point P recorded in the intrinsic matrix ^^ (e.g. based on coefficients k13, k23). However, the skilled person will appreciate that the intrinsic matrix ^^ may also be calculated based on the projection information 16, e.g. based on camera system identifiers or projection properties, as discussed above. When the principal axis A has been determined, the deflection metric may be calculated as a metric indicative of an opening angle associated with each point with respect to the principal axis A. For example, for each point at a given location in the pixel matrix, e.g. a given pixel position , the processing system 12 may calculate the deflection metric, e.g. or that point. The values of the deflection metric for each point may in a pix trix as projection data 18. The projection data 18 may then be provided with the image data 14 as an analysis data set 20, e.g. by encoding the deflection metric in a data channel of an image data file, or by providing the projection data 18 and the image data 14 as separate files to an analysis system (not shown). Fig.6 illustrates a schematic example of an analysis data set 20, in which an image of a scene is overlaid with the corresponding deflection metric for each point, wherein the deflection metric is the zenith angle α and is graphically illustrated by equipotential lines e α of the zenith angle α . The zenith angle α is calculated with respect to the principal axis A, which is perpendicular to the plane of projection and goes through the principal point P. Further, an exemplary point is illustrated in conjunction with an associated value of the local gradient ∇α of the zenith angle α . As schematically illustrated in Fig.6, the zenith angle α may be arranged in the pixel matrix of the image data 14 according to equipotential lines e α forming concentric ellipses or circles, such that for each point in image data 14 a corresponding measure of its distance to the principal point P can be provided. This distance is provided in a normalized form with respect to the angle between the projection ray B and the principal axis A, thereby providing a measure of the opening angle at that point. As can also be seen in Fig.6, the additional data channel, in which the zenith angle α is encoded for each point in image data 14 (e.g. pixel), may be used to reconstruct the local gradient ∇α of the zenith angle α . The local gradient ∇α can for example be calculated based on a convolutional Sobel operator as discussed in c junction with Fig.3. Hence, the zenith angle α provided in the pixel grid of the image data 14 also provides a measure of an azimuth angle β for locating the origin of the respective pixel in the projected plane. By further providing a distance measure (e.g. “D” or “R”, as shown in Fig. 1) for each pixel, the 3D point cloud associated with the pixel data can be reconstructed. The distance measure may be measured, e.g. with a LiDAR system or estimated, such as by providing the image data 14 to a depth estimation machine learning classifier. Fig. 7 illustrates an example of a classification system 22. The classification system 22 comprises a machine learning classifier 24 with an input interface 26 for receiving analysis data 20 composed of image data 14 and projection data 18 encoding a deflection metric in an array, which reflects the structure of the pixel matrix, in which the image data 14 is provided to the machine learning classifier 24. Machine learning classifiers 24 classify inputs according to an internal classification model based on a previous training with training data. The model may be a support-vector machine or preferably an artificial neural network, which receives the analysis set 20 for classification purposes. For image classification purposes, the machine learning classifier 24 preferably comprises a convolutional neural network, which comprises a plurality of artificial neurons arranged in layers. A convolutional neural network generally consists of an input layer, hidden layers and an output layer. In a convolutional neural network for applications in computer vision, the hidden layers usually include layers that perform convolutions. Typically, the convolutional layer includes a layer that performs a dot product of a convolution kernel with the layer's input matrix. The convolution may be followed by other layers such as pooling layers, fully connected layers, and normalization layers. The respective operations may be associated with weights, which may be learned by the artificial neural network in a training process. Artificial neural networks generally learn (or are trained) by processing examples, each of which may contain a known “input” and “result” forming probability-weighted associations between the two. The training of a neural network from a given example may be conducted by determining the difference between the processed output of the network and a target output. The learning of the artificial neural network may be supervised, unsupervised, or reinforced depending on the application as well as the availability of training data. For example, the artificial neural network may be provided with training data for which a desired output (e.g. a correct result) is already available (supervised) or can at least be attributed a measure of accuracy (reinforcement). In practice, the machine learning classifier 24 may be provided with a plurality of images and the model may learn patterns by adjusting the weights of the internal operations depending on the quality of the output, e.g. based on a back-propagation algorithm. The images may be provided to the machine learning classifier 24 as image data 14 in a pixel matrix. The projection data 18 may be provided to the machine learning classifier 24 at the input layer, or may be provided to hidden layers of an artificial neural network as an additional input. In some embodiments, the projection data 18 is provided to specific layers of artificial neurons or a second artificial neural network of the machine learning classifier 24, and outputs of said specific layers or said second artificial neural network are provided as inputs to hidden layers acting also on the image data 14, such as to provide the deflection metric to the convolutional neural network at an intermediate state of the classification task. For example, a convolutional neural network acting on the image data 14 may be provided in an encoder-decoder architecture and appropriately sized maps of the projection data 18 may be provided as additional input, e.g. at skip connections between the encoder and decoder layers of equal size. The machine learning classifier 24 may comprise a convolutional layer acting on the projection data 18, such that the machine learning classifier 24 may learn on its own to determine a measure indicative of the azimuth angle α at an appropriate position of the model, if such a measure improves classification of training images. In other words, the machine learning classifier 24 may automatically learn an operator, such as the Sobel operators described above. Accordingly, classification tasks, which are aimed at classifying objects in a three-dimensional scene, may be enhanced based on the provision of a single data channel including the deflection metric. Since, the deflection metric allows a convolutional neuronal network to reconstruct the azimuth angle β based on internal convolution operations, the network may also provide outputs related to the position of an object in the image or in the corresponding scene. Moreover, the deflection metric may generalize the projection information over different imaging systems, such that the resulting network can be applied to the classification of images of different origin. As the projection information is compressed into the deflection metric as a single data channel, the machine learning classifier 24 may feature lower computation time as well as faster convergence during training. The skilled person will appreciate that the structure of the image data 14 and the projection 18 data may be the same in embodiments, but that in principle the data 14, 18 may also be stored and provided in different file formats, e.g. with the individual values in different order, without changing the underlying structure. Hence, the condition that the structure of the image data 14 and the projection data 18 reflect each other should merely be considered as a requirement that corresponding values of the image data 14 and the projection data 18 can be clearly associated with each other. It is noted that in computer vision, the same image may be provided to different layers of a convolutional neuronal network at different scales for classifying objects in said image. The skilled person will appreciate that the deflection metric may equally be rescaled or may be generated at the respective scales in an encoding stage. In some embodiments, the deflection metric may also be generated only for those pixels in image data 14, which are processed by a machine learning classifier 24, e.g. when portions of the image data 14 are discarded for the purpose of object classification.
The skilled person will further appreciate that the intrinsic matrix I has been discussed with respect to an example of a linear 3 x 3 tensor. However, the intrinsic matrix may in principle be provided in the shape of a 3 x 4 tensor, e.g. as part of a camera matrix including a camera pose. The projection information may further comprise information on non-linear transformation, e.g. to reflect aberrations not covered by the pinhole camera model, which may be used in addition to the intrinsic matrix for determining the deflection metric.
For example, the projection model can be extended for non-linearity based on distortion polynomials like Brown-Conrady or Scaramuzza. As a specific example, for the radial distortion modelling by Scaramuzza, a radial distortion can be modelled polynomial over the zenith angle a, with r t as the coefficients:
In other words, the method is not limited to the simple pinhole camera projection model, but can be extended to more complex projection models with non-linear distortions using methods known in the art.
Fig. 8 illustrates a portion of a modified backbone meta-architecture of a machine learning classifier 24 as an example of a classification system 22. In the example of Fig. 8, a ResNet- FPN (Feature Pyramid Network) as the machine learning classifier 24 is modified on the basis of the projection data 18. The deflection metric in the projection data 18 is provided as a deflection image I α , which can be injected into the model at the input and selected locations of the machine learning classifier 24.
The deflection image I α can be a one-channel image, in which eveiy pixel u is aligned with the data image I. This can increase the information content of each pixel by the geometric sensor properties described by the data values of the deflection metric. The deflection image I α can be processed together with the image data 14 by convolutional layers of a convolutional neural network as part of a machine learning classifier 24. Unlike image data 14, the deflection image I α may not be invariant to translation, rotation, and scale. However, since the convolutional layers of convolutional neural network are learned, a machine learning classifier 24 can decide whether to use this additional information in the learning process. As shown in Fig.8, at the input stage, the deflection image Iα may be concatenated with a three- channel image I as image data 14, resulting in an input shape of h × w × 4. The image input can be processed top-down infive down-sampling stages. Each stage may halve the height and width w of the image data 14 and the projection data 18. This can be done for the image data 14 using striated convolutions, followed by a residual block (C1 to C5). The deflection image Iα can be down-sampled in parallel and concatenated to the features of the stages C1 to C5. In this case, the feature map can be used in every stage. After the injection, a 1 × 1-convolution may be used to fuse the features of the stage with the deflection metric. This allows the machine learning classifier 24 to keep or discard the deflection metric for a particular stage. The feature maps can be up-sampled from the bottom up prior to a fusion with the pristine feature map from the respective stages. The fusion can be performed by a channel-wise concatenation of the feature maps and a subsequent 3 × 3-convolution for anti-aliasing, as with common FPN architectures. This results in the pyramid stages P2-P5 with the respective shapes (h/2 i ) × (w/2 i ) × 256 (i denoting the stage index). The pyramid stages P2-P5 can then be fed into a semantic segmentation head, e.g. as described in Tsung-Yi Lin et al. (Feature pyramid networks for object detection). In some examples, the deflection metric may be down-sampled in parallel with the image data 14 during a convolutional analysis stage, and the values of the deflection metric may be concatenated with the corresponding feature maps derived from the image data 14 at said stage. The deflection metric and the features derived from the image data 14 may be fused at different convolutional stages, and the resulting features may be passed to a decoder portion of the machine learning classifier 24. The machine learning classifier 24 may be trained based on a set of training data, which may be augmented to simulate different sensors. For example, when the training data comes from a single or a low number of sensors, the data may be augmented, e.g. using resize and/or center-crop operations, in order to generate additional training data and simulate novel sensors during training. “Center-crop” changes thefield of view and “Resize” changes the resolution of a sensor. The combination of both may allow the simulation of various sensors during training. The inventors found in their experiments that machine learning models may be biased on the sensor resolution used during training, such that images at other resolutions may be classified with lower accuracy. Using augmented training data to train the machine learning classifier 24 with simulated additional sensors, the bias could be significantly reduced. Additionally including the projection data 14, e.g. as shown in the example of Fig.8, can further reduce the bias on the sensor resolution of the sensor used for obtaining the training images, and may increase the performance of semantic segmentation, e.g. the mean intersection over union (mIoU) as an evaluation metric of the performance of semantic segmentation. The description of the preferred embodiments and the figures merely serve to illustrate the invention and the beneficial effects associated therewith, but should not be understood to imply any limitation. The scope of the invention is to be determined solely by the appended claims.
Next Patent: HALOPERIDOL FOR USE IN THE TREATMENT OF SPINAL MUSCULAR ATROPHY