Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PARALLAX AMONG CORRESPONDING ECHOES
Document Type and Number:
WIPO Patent Application WO/2024/062365
Kind Code:
A1
Abstract:
The present invention concerns a method for performing a 3-dimensional scene reconstruction by using a sensor system comprising at least one emitter, and at least three receiving sensors. The method comprises the steps of: receiving (11) a data frame (11) associated with a sensor channel in response to a pulse being reflected from an object to be geometrically reconstructed; pre-processing (13) the data frame; detecting (15) an echo from the pre-processed data frame; extracting (17) a set of features from the detected echo; comparing (19) the extracted set of features in a given time window to corresponding sets of features from other sensor channels to obtain a respective echo correspondence score between the extracted set of features and a respective set of features of a respective echo of a respective other sensor channel; constructing (21) a respective ellipsoid for the detected echo and the respective echo for which the echo correspondence score indicates the best echo correspondence in the respective other sensor channel in the given time window to obtain a set of ellipsoids; and determining (21) an intersection point of the ellipsoids in the set of ellipsoids to perform 3D scene reconstruction.

Inventors:
HAHNE CHRISTOPHER (CH)
SZNITMAN RAPHAEL (CH)
Application Number:
PCT/IB2023/059225
Publication Date:
March 28, 2024
Filing Date:
September 18, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV BERN (CH)
International Classes:
G01S13/10; G01S7/41; G01S7/48; G01S7/539; G01S7/54; G01S13/87; G01S15/10; G01S15/87; G01S17/10; G01S17/87
Foreign References:
US20180074177A12018-03-15
GB2327266A1999-01-20
US10698094B22020-06-30
US20180074177A12018-03-15
GB2327266A1999-01-20
Other References:
K. K. PARIDAS. SRIVASTAVAG. SHARMA: "Beyond image to depth: Improving depth prediction using echoes", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2021, pages 8268 - 8277
C. HAHNE: "Multimodal exponentially modified gaussian oscillators", IEEE INTERNATIONAL ULTRASONICS SYMPOSIUM (IUS, 2022, pages 1 - 4, XP034239133, DOI: 10.1109/IUS54386.2022.9958253
C.-W. JUANJ.-S. HU: "Object localization and tracking system using multiple ultrasonic sensors with Newton-Raphson optimization and Kalman filtering techniques", APPLIED SCIENCES, vol. 11, no. 23, 2021
"ECHO ONE: 3D ultrasonic echolocation and ranging sensor", TOPOSENS GMBH, July 2022 (2022-07-01)
T. PADOISO. DOUTRESF. SGARD: "On the use of modified phase transform weighting functions for acoustic imaging with the generalized cross correlation", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, vol. 145, no. 3, 2019, pages 1546 - 1555, XP012236522, DOI: 10.1121/1.5094419
Attorney, Agent or Firm:
LUMI IP LLC (CH)
Download PDF:
Claims:
CLAIMS

1 . A method for performing a 3-dimensional scene reconstruction by using a sensor system (1 ) comprising at least one emitter (5), and at least three sensors (3), the method comprising:

• receiving (11 ) a data frame associated with a sensor channel in response to a pulse being reflected from an object (6) to be localised and/or geometrically reconstructed;

• pre-processing (13) the data frame to facilitate subsequent processing of the respective data frame;

• detecting (15) an echo and a time of arrival of the echo from the pre-processed data frame;

• extracting (17) a set of features from the detected echo;

• comparing (19) the extracted set of features in a given time window to corresponding sets of features from other sensor channels to obtain a respective echo correspondence score between the extracted set of features and a respective set of features of a respective echo of a respective other sensor channel, the respective echo correspondence score indicating directly or indirectly the likelihood that the respective echoes are reflected from the same object;

• mathematically constructing (21 ) a respective ellipsoid for the detected echo and the respective echo for which the echo correspondence score indicates the best echo correspondence in the respective other sensor channel in the given time window to obtain a set of ellipsoids; and

• determining (21 ) an intersection point of the ellipsoids in the set of ellipsoids to perform 3-dimensional scene reconstruction.

2. A method according to claim 1 , wherein a first radius of the respective ellipsoid is obtained from a respective time of arrival of the respective echo for which the respective ellipsoid is constructed, and a second radius and a third radius of the respective ellipsoid is obtained from the respective time of arrival of the respective echo for which the respective ellipsoid is constructed, and the location of the emitter (5) and the respective sensor (3).

3. A method according to claim 2, wherein the major axis of the respective ellipsoid equals and the minor axes where is the emitter position at a first focal point of the respective ellipsoid, v is the respective sensor position at a second focal point of the respective ellipsoid, and denotes the time of arrival of the respective echo.

4. A method according to any one of the preceding claims, wherein the set of ellipsoids comprises at least three ellipsoids.

5. A method according to any one of the preceding claims, wherein the features in the extracted set of features comprise any of the following features relating to the respective echo: amplitude, power, time of arrival and shape as described by the mean, and/or spread, and/or skew when the echo is modelled as a Gaussian distribution, as well as frequency and/or phase for oscillation modelling of the echo.

6. A method according to any one of the preceding claims, wherein the respective echo correspondence score is obtained as a distance metric computed feature-wise between the extracted set of features and the respective set of features of the respective echo of the respective other sensor channel.

7. A method according to any one of the preceding claims, wherein the set of features is extracted by using an energy-optimisation-based method and/or by using one or more convolutional neural networks.

8. A method according to claim 7, wherein the energy-optimisation-based method involves using a multimodal exponentially-modified Gaussian model, and wherein the multimodal exponentially-modified Gaussian model comprises one or more oscillation terms.

9. A method according to claim 8, wherein the echoes are represented as asymmetric Gaussians, and wherein the set of features are extracted in a three-stage non-linear least-squares regression process, wherein in a first stage, exponentially- modified Gaussian parameter regression of amplitude, power, time of arrival, and shape of the respective echo is performed, in a second stage, the amplitude, power, time of arrival, and shape of the respective echo are used to optimise oscillation parameters of frequency and phase of the respective echo, and in a third stage, joint parameter minimisation of the amplitude, power, time of arrival, shape, frequency, and phase of the respective echo is performed.

10. A method according to any one of the preceding claims, wherein the comparison of the extracted set of features is carried out in a multi-layer perceptron network (35), and wherein the multi-layer perceptron network is trained with one or more loss terms through a backpropagation arrangement.

11. A method according to any one of the preceding claims, wherein the detected echo is a reference echo fulfilling one or more criteria for serving as a reference echo, wherein the one or more criteria are derived from echo properties including at least one or more of the following: amplitude, power, time of arrival, frequency, phase, mean, spread, and skew for the asymmetric Gaussian modelling of the echo.

12. A method according to any one of the preceding claims, wherein the method further comprises assessing the reliability of the echo correspondence, and localising the object for the data frame only if the assessed reliability is above a given threshold value for at least a given number of sensor channels.

13. A method according to any one of the preceding claims, wherein the pre- processing of the data frame comprises at least one of the following operations: spectral filtering of the data frame, power loss compensation of the data frame, and Hilbert transformation of the data frame.

14. A method according to any one of the preceding claims, wherein the sensors are arranged in a symmetric configuration or in an asymmetric configuration, and/or wherein at least two of the sensors (3) operate in mutually different frequency bands.

15. A sensor system (1 ) for performing a 3-dimensional scene reconstruction, the sensor system (1) comprising at least one emitter (5), and at least three sensors (3), the sensor system (1) comprising means for:

• receiving in a sensor channel a data frame in response to a pulse being reflected from an object (6) to be localised and/or geometrically reconstructed; • pre-processing the data frame to facilitate subsequent processing of the respective data frame;

• detecting an echo and a time of arrival of the echo from the pre-processed data frame; • extracting a set of features from the detected echo;

• comparing the extracted set of features in a given time window to corresponding sets of features from other sensor channels to obtain a respective echo correspondence score between the extracted set of features and a respective set of features of a respective echo of a respective other sensor channel, the respective echo correspondence score indicating directly or indirectly the likelihood that the respective echoes are reflected from the same object;

• mathematically constructing a respective ellipsoid for the detected echo and the respective echo for which the echo correspondence score indicates the best echo correspondence in the respective other sensor channel in the given time window to obtain a set of ellipsoids; and

• determining an intersection point of the ellipsoids in the set of ellipsoids to perform 3-dimensional scene reconstruction.

Description:
PARALLAX AMONG CORRESPONDING ECHOES

FIELD OF THE INVENTION

The present invention relates to a method for performing a 3-dimensional (3D) scene reconstruction. More specifically, the method uses a sensor set-up by using an echo correspondence determination process. The present invention also relates to a sensor system for implementing the steps of the method.

BACKGROUND OF THE INVENTION

Certain animal species that move in 3D space are able to perceive surroundings by pulses emitted and bounced off from obstacles. This ability has motivated many computer vision applications. Over the course of recent decades, 3D computer vision has been dominated by stereoscopic parallax and time-of-flight (ToF) sensing. Depth perception in the light spectrum is a well-studied topic and adopted by simultaneous localisation and mapping (SLAM) to help robots navigate. However, varying weather (e.g., fog, rain, etc.) or severe lighting conditions remain challenging environments for computational imaging. Only little attention has recently been given to alternative 3D reconstruction models. To this end, the present invention establishes parallax among corresponding echoes (PaCE) as a depth sensing hybrid inheriting triangulation and ToF concepts at a geometric level.

Early research on sonar-based echolocation for mobile robots retrieved object points in two-dimensional (2D) space by means of ellipse intersections. Furthermore, for instance US10698094B2 discloses a localisation method based on 2D ellipse intersections. Several studies attempted to mimic a bat’s acoustic perception by modelling ears from two receivers and inferring obstacle locations using spherical coordinates. A recent breakthrough in 3D reconstruction from audible acoustics is based on deep learning architectures trained by camera-based depth maps without physical modelling, as disclosed for example in a publication by K. K. Parida, S. Srivastava, and G. Sharma, “Beyond image to depth: Improving depth prediction using echoes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021 , pp. 8268- 8277.

The existing literature and inventions disregarded the potential of 3D echolocation from ellipsoidal intersection and a preceding echo association so far. Here, echo association refers to the correct assignment of echoes to actual targets. As opposed to phased arrays where transducers are separated by a multiple of the wavelength and thus have specific locations, the proposed echo correspondence and geometric localisation model of the present invention fills this gap in 3D sonar tracking by enabling relaxation of sensor position constraints. The present invention enables the fusion of arrival times from either arbitrarily placed or even freely moving time-of-flight (ToF) detectors to gain new flexibility on the hardware and application side. In particular, the novel ellipsoid intersection model proposed by the present invention is considered the first solution for 3D scene reconstruction from at least three detectors and at least one emitter without constraints on their positions in space. It is to be noted that in the present description, the notation 3D scene reconstruction also covers 3D object localisation.

US2018/0074177A1 discloses a device and method for determining the 3D position of an object. The device comprises at least one transmitter that is adapted to emit a signal; at least three receivers, wherein the at least three receivers and the at least one transmitter are preferably arranged within a first plane, wherein a first receiver and a second receiver are preferably arranged along a first straight line, and a third receiver is preferably arranged at a distance from the first straight line; and a processor that is configured to determine at least three propagation times, wherein the respective propagation time is a time required by the signal from the transmitter via the object to the respective receiver. The processor is further configured to determine the 3D position of the object on the basis of the determined propagation times as well as on the basis of the arrangement of the transmitter and the receivers.

GB2327266A relates to an acoustic location system comprising an acoustic signal generator for generating acoustic signals, a plurality of acoustic sensors which operate to detect acoustic signals reflected by a target body and to generate data signals representative thereof. The data signals are fed to a data processor which is coupled to the acoustic sensors by connectors. The data processor identifies valid data by associating signals from said sensors according to their angle of arrival, amplitude and timing, before determining the location of the target body from an intersection of loci. The loci are representative of hypothetical locations of the target body determined in accordance with a time of flight of the acoustic signals from the acoustic signal generator to the acoustic signals reflected via the target body.

BRIEF DESCRIPTION OF THE INVENTION

The present invention aims to overcome at least some of the above-identified problems. More specifically, to overcome the above problems, the present invention proposes a novel method and a corresponding measurement or sensor system for carrying out a 3D scene reconstruction.

According to a first aspect of the invention, a method for performing a 3D scene reconstruction is provided as recited in claim 1.

The method optionally comprises positioning the at least three sensors arbitrarily in three-dimensional space.

The proposed invention relies on a novel echo correspondence determination process, which in some implementations may be a phase-invariant process. This leads to relaxation of sensor position constraints compared with sensor set-ups of state-of-the- art solutions. This enables non-static detector arrangements provided that the detector positions are known at the time of the measurement. Echo correspondence determination may be implemented by using algorithms, such as non-linear least- squares optimisation and/or neural networks. If the teachings of the present invention are used for 3D object localisation, then the proposed solution is also more accurate than phase-correlated 3D object localisations. Furthermore, lower sample rates are needed compared with sample rates used in phase-correlation methods. The proposed solution requires fewer sensors than in phased arrays (less hardware costs). Moreover, the application fields of the proposed invention are vast. For instance, the proposed invention can be applied to light (light-emitting diodes (LEDs) and photodiodes) or electromagnetic spectra.

According to a second aspect of the invention, a sensor system is provided, which is configured to implement the method according to the first aspect of the present invention.

Other aspects of the invention are recited in the dependent claims attached hereto.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in more detail with reference to the attached drawings, in which:

• Figure 1 schematically illustrates a sensor system that can be used to implement the teachings of the present invention;

Figure 2 shows a flowchart illustrating the 3D scene reconstruction method according to an example of the present invention; Figure 3 illustrates a cross-sectional geometric model of two ellipsoids showing one emitter and two receiving sensors;

• Figure 4 shows an echo correspondence architecture to determine mutually matching echoes across a plurality of sensor channels;

• Figure 5 shows an overview of a pre-processing and feature extraction pipeline according to an example of the present invention;

• Figure 6 illustrates visually the echo correspondence determination used in the proposed 3D scene reconstruction method according to an example of the present invention;

• Figure 7 shows Table I illustrating experimental 3D localisation results;

• Figure 8 shows experimental object localisation results showing 18 position estimates s* in the [-80 mm, 0 mm, 80 mm] interval of the (xy)-plane and [100 mm, 180 mm] interval in z-direction. The upper left diagram shows s* in 3D, whereas adjacent plots depict 2D projections of the same. Greyscale tones illustrate the individual root mean square error (RMSE) while dashed circles represent the mean RMSE;

• Figure 9 shows echo correspondence in A-Scan frames y n (t i ) at ground- truth (GT) position s = [80.0 mm, 0.0 mm, 180.0 mm] captured by N = 4 sensors with echo association highlighted in grey as hatched exponentially- modified Gaussian (EMG) components and frame confidence C n as defined in C. Hahne, “Multimodal exponentially modified gaussian oscillators,” in 2022 IEEE International Ultrasonics Symposium (IUS), 2022, pp. 1-4. See the resolved echo ambiguity in y 1 (t i ) ; and

• Figure 10 shows Table II illustrating ablation overview for echo association.

DETAILED DESCRIPTION OF THE INVENTION

It should be noted that the figures are provided merely as an aid to understanding the principles underlying the invention, and should not be taken as limiting the scope of protection sought. Where the same reference numbers are used in different figures, these are intended to indicate similar or corresponding features. It should not be assumed, however, that the use of different reference numbers is intended to indicate any particular degree of difference between the features to which they refer. As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second”, “third”, etc. to describe a common object, merely indicate that different instances of like or different objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner. Bold fonts are used to describe vectors (lower-case letter) or matrices (upper-case letter), where the vectors are to be understood as a finite sequence of numbers of a fixed length.

Some definitions that help understand the teachings of the present invention are given next.

A sensor channel or sensor continuously records the amplitude of energy packets that physically arrive on the sensing surface at different points in time. A sensor is generally only responsive to a certain light, electromagnetic or sonic wave spectrum. Depending on the implementation, the sensor according to the present invention may for instance be a photon-sensitive ToF detector, a radar antenna or an ultrasound transducer.

A data frame, or simply frame, is characterised by a set of sensor amplitudes recorded consecutively at a certain time interval. In time-of-flight applications, this time interval typically starts with the emittance of a pulse and ends after a defined time cycle before it starts again to capture a subsequent frame. This time interval, i.e., the frame length, determines the maximum distance to be measured in a time-of-flight application.

A pulse is an energy signal emitted by an active depth sensing device (light- emitting diode (LED), laser, antenna, transducer, etc.) into a certain direction in 3D space. It can be emitted to several directions simultaneously forming a solid angle. The purpose of the pulse is to hit surrounding objects and be bounced back to the sensor system as an echo.

An echo is a portion of the emitted energy pulse reflected at an object’s surface and travelled back towards the sensor device. In signal processing, an echo is represented by multiple consecutive sensor amplitude samples. The time that passed from pulse emittance until an echo arrives at the sensor is known as time-of-arrival.

Figure 1 illustrates an example sensor system 1 that may be used to implement the teachings of the present invention. The system comprises a first receiver 3 (i.e., a receiving sensor or transducer), a second receiver 3, a third receiver 3, and a first transmitter or emitter 5 (i.e. , a transmitting sensor or transducer) configured to wirelessly emit a pulse or a sequence of pulses, also known as bursts. The emitted pulses then hit a target or object 6, where the pulses are reflected to be subsequently wirelessly received at, or captured by the receivers 3, which then detect echoes from the received data frames. In this example, the three receivers form an equilateral triangle, with the transmitter at the centre of the triangle. Thus, the sensors form in this example a symmetric arrangement or configuration. However, any other sensor arrangement is possible, as long as the number of sensors is at least three, and there is at least one emitter. According to one variant, one of the sensors would also operate as a transmitter to thereby form a transceiver. Thus, the system includes at least three sensors with at least three being able to receive signals and at least one being capable of transmitting/emitting pulse-encoded signals. The system of Figure 1 also comprises a data processing unit 7, which is configured to be in data communication (wirelessly or through a wired connection) with the sensors and optionally with the emitter. The data processing unit may be a cloud computing unit, or it could be embedded into the sensor device, for instance. The data processing unit may implement various data processing tasks as explained later in more detail. Thus, if the sensors and the data processing unit are separate elements, then the sensors may for example simply carry out data acquisition but the other method steps as described later could be implemented by the data processing unit. However, other kind of division of the tasks is equally possible.

The sensors 3 may be for instance piezoelectric transducers, antenna elements or photodiodes, arranged as single elements or consistently spaced arrays of these elements. The emitter 5 may be a piezoelectric transducer, a laser source, a light- emitting diode (LED) or a conductive electrical element. The emission directivity of the emitter 5 is ideally isotropic, but it may have a given solid angle suitable for the region of interest. The angle-of-view (AoV) of the sensors is ideally a hemisphere with isotropic sensitivity, but they may have a smaller solid angle suitable for the region of interest. The sensor positions may be arbitrary (i.e., static or non-static), which form intersecting ellipsoids as described later. The sensors do not have to be co-planar, i.e., they do not have to lie in the same plane. Furthermore, the sensors do not need to be separated by a wavelength of the transmitted pulse or by its multiple (due to phase-invariance). Advantageously, the sensor arrangement for static setups is a point-symmetric arrangement (with symmetric localisation deviation in mind), where the emitter 5 has an equal distance to each sensor 3. In other words, the emitter would in this arrangement be at the centroid (point-symmetry) of the sensor setup (but in other examples, the emitter would be placed outside of the centroid). In the point-symmetric setup, the receivers would form an equilateral triangle (with three receivers), a square (with four receivers), pentagon (with five receivers), hexagon (with six receivers), heptagon (with seven) receivers), etc. It is to be noted the sensors may be of different types such that for example a camera can be combined for instance with at least two ultrasound transducers or at least two photodiodes with a corresponding transmitter. In other words, at least two of the sensors can be arranged to operate in mutually different frequency bands.

The proposed method is next illustrated with reference to the flowchart of Figure 2. The main inventive contributions of the present invention are then explained later in more detail. In step 11 , the sensors 3 capture the data frames from the target. In other words, in this step synchronous sensor data acquisition is carried out to record target reflections. Thus, the system uses a master clock to allow the sensors to operate at identical or substantially identical rates. This means that time scales of the emitter and the sensors are aligned. As in this particular example the remaining steps are carried out by the data processing unit 7, step 11 also involves the sensors 3 sending the acquired data to the data processing unit. It is to be noted that step 11 may be preceded by a step of positioning the sensors 3 arbitrarily in 3D space, i.e., placing the sensors in arbitrary locations or positions in 3D space.

In step 13, the data processing unit 7 carries out signal pre-processing. The objective of the signal pre-processing is to facilitate subsequent processing tasks such as time-of-arrival detection and echo feature extraction. The pre-processing may comprise, but is not limited to frequency domain filtering, such as band-pass filtering, compensation of the power loss, which is a result of the inverse square law, and/or Hilbert transformation of the acquired raw signal channel data.

In step 15, the data processing unit carries out echo detection across a given set of receivers 3 by using time-of-flight echo detection techniques or other suitable techniques. Echo detection in time-of-flight applications is a well-studied subject with a broad variety of solutions ranging from electronic circuitries to digital processing with algorithms employing thresholds, gradients, the Fisher matrix, dictionary-based methods and more recently convolutional neural networks.

The association of echoes that belong to a distinct target is an important step according to the present invention, which has not been satisfyingly addressed in prior work. It is important to note that matching echoes from different targets yields false positive target positions making this step an important aspect of the present invention. To address this echo correspondence problem, features are extracted in step 17 from each detected echo and compared in step 19 against echoes from other sensor channels.

Feature extraction may be accomplished using various methods, such as energy- based optimisation methods (e.g., multimodal exponentially-modified Gaussians oscillators) or convolutional neural networks (e.g., ResNet). Once echo features are established, the features are fed as vectors into the echo matching process, which analyses the likelihood of echoes from different sensor channels belonging together.

This echo correspondence likelihood is quantitatively obtained by distance metrics and expressed as a so-called dissimilarity (or similarity) score. In the simplest form, such a metric is represented by the Euclidean distance. A more advanced concept comprises networks with identical neural network architectures and shared weights for each sensor channel (often referred to as Siamese networks). To train such a dissimilarity network for desired scores, a variety of losses can be employed with the contrastive loss being among the preferred ones. Step 19 may also involve selecting one of the echoes across different sensor channels as a reference echo. The reference echo fulfils one or more criteria for serving as a reference echo, wherein the one or more criteria are derived from echo properties including at least one or more of the following: amplitude, power, frequency, arrival time and signal shape in the form of the width, mean or/and skew in the case of Gaussian modelling. The reference echo may be the strongest echo (or one of the strongest echoes) among the different sensor channels in a given time window. Steps 17 and 19 can be considered to collectively form an echo correspondence determination process.

In step 21 , ellipsoids are mathematically constructed for the respective echoes. The ellipsoid generation process may be implemented so that a respective ellipsoid is mathematically constructed for the reference echo (in its reference sensor channel) and for each best-matching echo per sensor channel in a given time window. As an example, the length of this time window may be 500 milliseconds, 200 milliseconds, or more specifically 100 milliseconds or even 50 milliseconds and/or it may vary in length. Furthermore, in this example the time window length is decoupled from the measurement or pulse emission cycle. For instance, one may use a time window of 25 milliseconds, but only emit pulses every 50 milliseconds.

In step 23, the location of the target is determined by using an ellipsoid intersection model, which may be understood to be based on a zero-crossing method. A three- dimensional target position is represented by the intersection of geometric ellipsoids that have their foci located at sensor positions and that are spanned by detected time-of- arrivals (ToAs) from corresponding sensor echo pairs. The precise intersection point retrieval can be obtained by conventional gradient methods, such as gradient descent, conjugate gradient, Gauss-Newton, or Levenberg-Marquardt. The ellipsoid for each receiver is mathematically constructed within a given time window as is explained next in more detail.

In classical one-dimensional (1 D) range finding, the transmitting and receiving transducers are identical, which implies that the outward and return travel paths coincide. In this special case, distances from a target to a sensor can be readily obtained by where y n (t i ) denotes the captured amplitude data from sensor n, c s is the propagation velocity and the divider accounts for the equidistant forward and backward travel paths. In the present description, index n is the same index for several vectors as it represents the sensor number or respective ellipsoid, frame, channel, etc. It is to be noted that Equation 1 merely describes one way of obtaining echoes, but other ways of detecting of echoes may be equally used in the present invention. Each time sample with a total number T of samples, qualifies to be a detected echo once the gradient of the respective Hilbert-transformed amplitude signal surpasses a threshold τ. Note that such single sensor setup generally yields radial distances making accurate directional information retrieval of the surrounding targets intractable. To facilitate subsequent signal processing tasks, a background capture y b (t i ) is deducted from each acquired frame y r (t i ) such that y(t i ) = y r (t i ) - y b (t i ) .

Pinpointing a 3D target landmark generally involves advanced geometric modelling. By using a receiver and a transmitter in a so-called round-trip setup, ToA detection still yields whereas travel paths may be non-equidistant (i.e. , Equation 1 does not hold). This is because a distinct receiver position causes the travel direction to change after target reflection spanning a triangle between transmitter receiver and a potential target position (see Figure 3). This triangle has its roots in the parallax concept, where an object point is observed from at least 2 different viewpoints. The vector between u and can be regarded as baseline and while this is given, the triangle’s side lengths (i.e., travel paths) remain unknown for the single receiver round-trip case. Here, all travel path candidates form triangles with equal circumference fixed at the baseline vector. Closer inspection of Figure 3 reveals that feasible object positions yield an infinite set of solutions located on the surface of an ellipse for a 2D plane and - when extended to 3D - this solution set is represented by an ellipsoid. The surface of an ellipsoid thus reflects potential target locations in a 3D round-trip scenario comprising one transmitter u and one receiver v 1 . Adding a second receiver to capture a ToA from the same target spans a second ellipsoid that intersects with the first ellipsoid along a curve, which carries a subset of points from the solution set and thus narrows down the solution candidates. Only by introducing a third receiver and its respective ellipsoid, the target’s 3D location ambiguity can be resolved at a point where this solution curve and the third ellipsoid exhibit an intersection. It is mathematically demonstrated hereafter that a group of echoes reflected from the same object and captured by N ≥ 3 sensors enables retrieval of the target position that resides on N ellipsoid surfaces.

Let any point lie on the surface of ellipsoid n if and only if which is given by where This radii vector is constructed by a major axis given by and the minor axes obtained by with as the transmitter and as receiver positions at the focal points of the ellipsoids. In fact, the two minor axes indicate that transducer pairs span oblate spheroids. These surface definitions are valid for ellipsoids whose center resides on the coordinate origin and whose axes r n are in line with the coordinate axes. In the general case, the displaced ellipsoidal surface may be arbitrarily oriented and positioned and consists of space coordinates for which a center and rotation matrix are introduced so that where it is possible to make use of as a rotation matrix property. According to the aforementioned geometric definitions, a potential target ideally resides on the surface of at least three ellipsoid bodies. In a mathematical sense, this statement holds true for a point that satisfies by plugging Equation 5 into Equation 2. Hence, solving for s* breaks down to classical root-finding such that the employment of a multivariate gradient descent (GD) method is sufficient here. The GD update at iteration j and step size γ reads where the ellipsoid function vector is given by with The Jacobian with respect to is composed as follows where denote the partial derivatives with respect to x, y and z, which are computed for each iterative step j until convergence. The estimated location point is then selected via considering N ellipsoid surfaces.

Up until this stage, it is presumed that the radii r n are drawn from detected echoes that originate from the same target. However, real world scenes are made up of complex topologies with reflections arising from multiple objects resulting in several echoes per channel. Hence, an inter-channel echo matching is an important and non- trivial undertaking for the proposed echo localisation to work, which is addressed in what follows.

The association of echoes that belong to a distinct target is an essential prerequisite for the proposed localisation to work properly. For instance, matching echoes arriving from different targets yields false positive object positions making the correspondence assignment a very useful step. Correct echo matching is a non-trivial undertaking that has been given only little attention in the past. This architectural design is visually outlined in Figure 4.

Figure 4 illustrates the echo correspondence architecture overview where each pre-processed detector (i.e. sensor) channel data y n (t i ) from detector channels 31 undergoes in this example self-supervised multi-modal exponentially-modified Gaussian (MEMG) oscillators optimisation carried out in feature extraction modules 33 providing feature components that are fed into an echo association module 35, which in this example is an artificial intelligence network, and more specifically an artificial neural network, and in particular a multi-layer Siamese neural network stack. The overall training loss for backpropagation consists of the binary cross-entropy (BCE) loss and the contrastive loss and is in this example obtained in a loss calculation module 37.

The present invention in this regard and in this example builds up on the MEMG model as a way to compress sonic echo information. As opposed to the originally proposed method, the oscillation terms are skipped here for reasons of phase invariance such that the mode writes with a concatenated component parameter vector containing each component estimate . While the integral over k ∈ {{, 2, ...,K} takes care of the multimodal distribution, each EMG is parameterised by and with erf(·) as the error function. An MEMG parameter set is estimated using an energy-based optimisation framework with the goal to minimise the following functional for each frame n. This objective function is in this example solved using the Levenberg- Marquardt scheme with Hessians being approximated by analytical Jacobians with respect to at each iteration j. The best approximated MEMG vector is given by which carries echo component estimates that serve as features for the subsequent echo association.

The above-described feature extraction process may be modified by taking into account the oscillation term. In this case, the MEMG model with the oscillation term is defined as where f and Φ denote frequency and phase, respectively. The oscillation function is obtained by using the cosine. This model thus introduces an oscillating term to the MEMG model to regard multiple echoes as univariate probability distributions in the time domain. In this scenario and as illustrated in Figure 5, the feature extraction process is sub-divided into three-stages of non-linear least-squares (NLLS) regression with each being in this example solved by the Levenberg-Marquardt (LM) algorithm to achieve robust convergence. While the proposed framework gains characteristic features and reduces dimensionality, it thereby enables echo segmentation and lays the groundwork for classification tasks.

As is further shown in Figure 5, prior to the MEMG model estimation, an oscillating signal is routed through a band-pass filter that eliminates spectral intensities around the dominant frequency detected by a Fourier transform. A second pre-processing stage, namely power compensation, counteracts the power loss over distance, which is caused by anisotropic radiation. Each incoming data frame y n (t i ) can be optionally treated with an exponential fit of where a and b are fitted variables. In a third pre-processing stage, Hilbert transformation is carried out on the incoming frames.

Iterative optimisations are known to be error-prone for start values with large numerical distance from the solution. For a robust regression, LM iterations are split into a three-stage process as is further shown in Figure 5 with (1 ) Hilbert-only EMG parameter regression of [α k k , σ k , η k ], which are used for (2) oscillation parameter optimisation [f k , Φ k ] and (3) a joint parameter minimisation of p k at a final step. Initial positional estimates are obtained by using a threshold τ for the gradient of the Hilbert-transformed amplitude signal as seen in where denotes the Hilbert transform. Other parameters are initialised for all k with where is the operating frequency given in kHz to avoid a large condition number in the Jacobian. Phase estimates are constrained to be

Given the extracted echo feature dimensions, one may infer quantitative information about their estimation reliability. This can be done by using confidence measures. This step can may be part of the NLLS fitting operation shown in Figure 5. For blind MEMG assessment, we define the confidence C n per A-scan by with an inverted norm so that large values indicate higher certainty. Similarly, this can be expressed as a per-component confidence c n,k which writes where denote the fitted and raw signals in the range of the detected position respectively.

It is important to note that MEMG parameters can be extended by other features to form a parameter set of length D which is used in the subsequent echo correspondence selection. Provided the echo features a multi-layer perceptron (MLP) (i.e. the Siamese network) is in this example trained for echo selection and correspondence decision- making. The scalar output b n,k of each MLP reads where denotes MLP function layers at indices l ∈ {1,2, 3, 4}. Each layer is equipped with trainable weights W l and activated by a Rectifier Linear Unit (ReLU) except for which is followed by the sigmoid function. In this example, learnable weight dimensions correspond to with respective bias weights. The binary cross entropy (BCE) is employed to minimise b k during training via where Y k ∈ {0,1} represents ground-truth labels for each echo k and channel index n while the latter is omitted in loss functions for sake of readability. As an initial echo classification, the BCE loss helps select an appropriate reference echo across all channels n and components k via where denotes the MLP layers as functions with layer indices I∈ {1,2, 3, 4}. Echo correspondence is then established through a dissimilarity score d k (also referred to as an echo correspondence score) between learned feature output embeddings from layer functions and a component-wise Euclidean distance that writes for each n. This similarity metric is used in the contrastive loss and conceptually given by where q > 0 is the margin regulating the border radius. Besides, this dissimilarity score acts as an indicator of how reliable a selected echo component is. For training, a loss aggregation is used as follows where the weights amount to λ C = 1 and λ B = 10 as an example to balance the numerical gap whereas different values may be chosen to achieve desired training results. Figure 6 visually illustrates the echo correspondence determination.

The above-described method steps may be carried out by suitable circuits or circuitry. The terms “circuits” and “circuitry” refer to physical electronic components or modules (e.g. hardware), and any software and/or firmware (“code”) that may configure the hardware, be executed by the hardware, and or otherwise be associated with the hardware. The circuits may thus be operable to carry out or they comprise means for carrying out the required method steps as described above.

At least some of the method steps can be considered as computer-implemented steps. The invention also relates to a non-transitory computer program product comprising instructions for implementing the steps or at least some of the steps of the method when loaded and run on computing means of a computing device, such as the data processing unit.

Some experimental results are discussed next. For data acquisition, a six-axis, vertically-articulated robot arm (Meca500) is employed to navigate a convex target to ground-truth (GT) positions. To suppress reflections from the robot, frames are subtracted by captures from an empty run in the absence of the target. A dedicated training and validation set of 302 frames is captured by N = 4 sensors where each acquisition contains at least 3 detected echoes per channel, yielding approximately 3600 EMG components for training overall. From this, a fraction of 0.3 is reserved as a validation set. Labels Y k are inferred by projecting GT positions as GT ToA μ gt in the time domain. Only a single MLP is trained due to Siamese networks sharing weights. Using an Adam optimiser, a learning rate of 5e -4 has shown to perform best. The frame batch size is 1 , whereas losses of every k-th echo are back-propagated at each step. Weights are chosen to be λ C = 1 and = 10 to balance the numerical loss gap. To prevent over-fitting, the maximum number of epochs is limited by early stopping criteria with tolerance = 5 and min. delta = 0.

Numerical object localisation results from the acquired test data taken with the robot are provided in Table I of Figure 7 showing experimental 3D localisation results in mm. An important observation from this experiment is a tendency of more significant errors with increasing radial distance from the xy-origin u = [0, 0, z] T of the sensor device. To make this visible, Figure 8 depicts cross-sectional projections of the results. The radial error in (x,y) is expected as minor deviations in relative ToAs (i.e., TDoAs) produce an enormous impact when jointly propagating to the xy-plane during an ellipsoid intersection. Therefore, the prototype sensor arrangement is radially symmetric, creating a point-symmetric error distribution around the centre u in the ideal case. It is also essential to consider that deviations are specific to the sensor hardware. This includes a limited temporal resolution (T = 64), just as potential mechanical sensor misalignments. Also, minor object movements caused sonic interference in the experiment, letting target echoes fluctuate and almost disappear in certain frames. Besides that, echo detection and MEMG regression occasionally fuse closely overlapping echoes to a single detected component. To set the results from Table I in a broader context, a fair comparison of the proposed model with state-of-the-art methods would involve using the same sensor hardware and setup, which exceeds the scope of an initial feasibility experiment. Instead, error measures from closely related experiments are reported hereafter. Recently, C.-W. Juan and J.-S. Hu, “Object localization and tracking system using multiple ultrasonic sensors with Newton-Raphson optimization and Kalman filtering techniques,” Applied Sciences, vol. 11 , no. 23, 2021 , presented a 2D finger position tracking with an RMSE of 0.7 ± 0.5 cm using an extended Kalman filter for 6 transducers, each running at 40 kHz. The 3D object tracking device released by manufacturer Toposens GmbH, “ECHO ONE: 3D ultrasonic echolocation and ranging sensor,” Data Sheet V1 .0, July 2022, achieves 1.0 ± 2.5 cm errors by correlating bounced-off phase signals from 3 receiving transducers running at 40 kHz, which are perpendicularly placed to each other in the range of the wavelength. Given that these deviations are from different sensor hardware, the results in Table I are within the expected error range.

Another important premise for the present invention to work is the proposed echo correspondence solver, which gives a promising F 1 -score of 1.0 on the test data from Table I, where all 18 echo correspondences are matched correctly. Figure 9 depicts an exemplary correspondence data sample.

Table II of Figure 10 provides an overview of each module’s impact on the overall matching performance of the proposed framework. The ablation study of the echo correspondence network is carried out by substituting the MLP from Equation 22 with an arg max operator of the amplitude scale α k and the contrastive loss from Equation 25 with the Munkres (also known as Hungarian) algorithm. The impact of MEMG features from Equations 11 to 15 is evaluated by its replacement with the ToA from Equation 1 and its amplitude value Table II demonstrates that MLP and contrastive loss outperform the Munkres-based echo association accuracy. Furthermore, MEMG features are more reliable correspondence indicators than just using ToAs. A suitable alternative to MEMG, e.g., based on learned convolutions, is yet to be devised since existing methods in the field (e.g., T. Padois, O. Doutres, and F. Sgard, “On the use of modified phase transform weighting functions for acoustic imaging with the generalized cross correlation,” The Journal of the Acoustical Society of America, vol. 145, no. 3, pp. 1546- 1555, 2019,) employ waveform data.

While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive, the invention being not limited to the disclosed embodiment. Other embodiments and variants are understood, and can be achieved by those skilled in the art when carrying out the claimed invention, based on a study of the drawings, the disclosure and the appended claims. New embodiments may be obtained by combining any of the teachings above.

In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used. Any reference signs in the claims should not be construed as limiting the scope of the invention.