Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND APPARATUS FOR TRACKING MOVING AUDIO SOURCES
Document Type and Number:
WIPO Patent Application WO/2017/129239
Kind Code:
A1
Abstract:
A method comprises: receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames; determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; converting the spatial energy distribution into multiple direction-of-arrival measurements; and based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

Inventors:
VILERMO MIIKKA TAPANI (FI)
TAMMI MIKKO TAPIO (FI)
NIKUNEN JOONAS SAMULI (FI)
VIRTANEN TUOMAS (FI)
Application Number:
PCT/EP2016/051709
Publication Date:
August 03, 2017
Filing Date:
January 27, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04R3/00; G10L21/0272
Foreign References:
US20040252845A12004-12-16
US20150055797A12015-02-26
Other References:
VALIN ET AL: "Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering", ROBOTICS AND AUTONOMOUS SYSTEMS, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 55, no. 3, 15 February 2007 (2007-02-15), pages 216 - 228, XP005891315, ISSN: 0921-8890, DOI: 10.1016/J.ROBOT.2006.08.004
NIKUNEN JOONAS ET AL: "Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 22, no. 3, 1 March 2014 (2014-03-01), pages 727 - 739, XP011539739, ISSN: 2329-9290, [retrieved on 20140210], DOI: 10.1109/TASLP.2014.2303576
R. H. BYRD; R. B. SCHNABEL; G. A. SHULTZ: "A trust region algorithm for nonlinearly constrained optimization", SIAM JOURNAL ON NUMERICAL ANALYSIS, vol. 24, no. 5, 1987, pages 1152 - 1170
H. NIES; O. LOFFELD; R. WANG: "Geoscience and Remote Sensing Symposium, 8. IGARSS 8. IEEE International", vol. 4, 2008, IEEE, article "Phase unwrapping using d-kalman filter-potential and limitations", pages: IV-1213
S. SÄRKKÄ; A. VEHTARI; J. LAMPINEN: "Rao-Blackwellized particle filter for multiple target tracking", INFORMATION FUSION, vol. 8, no. 1, 2007, pages 2 - 15
"Rao-Blackwellized Monte Carlo data association for multiple target tracking", PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON INFORMATION FUSION, vol. 1, no. I, 2004, pages 583 - 590
J. NIKUNEN; T. VIRTANEN: "Direction of arrival based spatial covariance model for blind sound source separation", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 22, no. 3, 2014, pages 727 - 739
A. P. DEMPSTER; N. M. LAIRD; D. B. RUBIN: "Naximum likelihood from incomplete data via the EM algorithm", JOURNAL OF THE ROYAL STATISTICAL SOCIETY, vol. 39, no. 1, 1977, pages 1 - 38
"Multichannel extensions of non-negative matrix factorization with complex-valued data", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 21, no. 5, 2013, pages 971 - 982
Attorney, Agent or Firm:
ANDERSON, Oliver (GB)
Download PDF:
Claims:
Claims

1. A method comprising:

receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames;

determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame;

converting the spatial energy distribution into multiple direction-of-arrival measurements; and

based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

2. A method according to claim l, further comprising:

determining for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

3. A method according to claim 2, further comprising:

estimating for each time frame a spectrogram model for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals.

4. A method according to claim 3, further comprising:

separating the frequency domain representations of the at least two audio signals based on the estimated spectrogram models using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources.

5. The method according to any of claims 2 to 4, comprising:

restoring spatial energies for each identified audio source based on the associated directions relative to the recording module array; and determining the spatial covariance matrix models for each identified source based on the restored spatial energies.

6. A method according to any preceding claim, comprising:

performing the acoustical tracking using a particle filtering algorithm.

7. A method according to claim 6, wherein the particle filtering algorithm comprises a Rao-Blackwellized particle filter. 8. A method according to any preceding claim comprising:

determining the spatial energy distribution using a steered response power algorithm with a phase transform weighting.

9. The method according to any preceding claim comprising

converting the spatial energy distribution into multiple direction-of-arrival measurements by estimating a wrapped Gaussian mixture model of the observed spatial energy in each time frame.

10. The method according to claim 9, comprising decluttering the direction-of- arrival measurements by filtering out measurements having particular parameters which do not satisfy a particular criterion.

11. The method according to claim 10, wherein the direction-of-arrival measurements comprise plural mean angle-of-arrival values and a variance associated with each mean angle-of-arrival value, the method further comprising:

filtering out the direction-of-arrival measurements for which the associated variance is above a threshold.

12. The method according to either of claims 10 and 11, wherein each of the direction-of-arrival measurements comprise plural mean angle-of-arrival values and a weight associated with each mean angle-of-arrival value, the method further comprising:

filtering out the direction-of-arrival measurements for which the associated weight is below a threshold.

13. The method according to any preceding claim wherein estimating the spectrogram models for each of the one or more moving audio sources comprises performing iterative optimisation of parameters of the spectrogram models.

14. The method according to claim 13, comprising, prior to performing iterative optimisation of the parameters of the spectrogram models, initializing the parameters using spatial energy distribution values for each identified source which are scaled such that the spatial energy distribution values sum to unity when the source is active and which are set to zero when the source is inactive.

15. The method according to any preceding claim, comprising:

transforming the at least two audio signals into their frequency domain representations using a short-time Fourier transform.

16. The method of any preceding claim comprising:

synthesising audio signals for each identified moving audio source based on the separated frequency domain representations of each of the moving audio sources.

17. A method comprising:

receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames;

determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame;

converting the spatial energy distribution into multiple direction-of-arrival measurements;

based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame;

determining for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame; estimating for each time frame a spectrogram model for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals; and

separating the frequency domain representations of the at least two audio signals based on the estimated spectrogram models using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources.

18. Apparatus configured to perform a method according to any preceding claim.

19. Computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method as claimed in any of claims 1 to 17.

20. Apparatus comprising:

at least one processor; and

at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus:

to receive frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames; to determine, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame;

to convert the spatial energy distribution into multiple direction-of- arrival measurements;

based on at least a subset of the direction-of-arrival measurements, to perform acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

21. The apparatus according to claim 20, wherein the computer program code, when executed by the at least one processor, causes the apparatus to determine for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

22. The apparatus according to claim 21, wherein the computer program code, when executed by the at least one processor, causes the apparatus to estimate for each time frame a spectrogram model for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals.

23. The apparatus according to claim 22, wherein the computer program code, when executed by the at least one processor, causes the apparatus to separate the frequency domain representations of the at least two audio signals based on the estimated spectrogram models using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources.

24. The apparatus according to any of claims 21 to 23, wherein the computer program code, when executed by the at least one processor, causes the apparatus to: restore spatial energies for each identified audio source based on the associated directions relative to the recording module array; and

determine the spatial covariance matrix models for each identified source based on the restored spatial energies.

25. The apparatus according to any of claims 20 to 24, wherein the computer program code, when executed by the at least one processor, causes the apparatus to: perform the acoustical tracking using a particle filtering algorithm.

26. The apparatus according to claim 25, wherein the particle filtering algorithm comprises a Rao-Blackwellized particle filter.

27. The apparatus according to any of claims 20 to 26, wherein the computer program code, when executed by the at least one processor, causes the apparatus to determine the spatial energy distribution using a steered response power algorithm with a phase transform weighting.

28. The apparatus according to any of claims 20 to 27, wherein the computer program code, when executed by the at least one processor, causes the apparatus to convert the spatial energy distribution into multiple direction-of-arrival measurements by estimating a wrapped Gaussian mixture model of the observed spatial energy in each time frame. 29. The apparatus according to claim 28, wherein the computer program code, when executed by the at least one processor, causes the apparatus to declutter the direction-of-arrival measurements by filtering out measurements having particular parameters which do not satisfy a particular criterion. 30. The apparatus according to claim 29, wherein each of the direction-of-arrival measurements comprise plural mean angle-of-arrival values and a variance associated with each mean angle-of-arrival value, wherein the computer program code, when executed by the at least one processor, causes the apparatus to filter out the direction- of-arrival measurements for which the associated variance is above a threshold.

31. The apparatus according to either of claims 29 and 30, wherein each of the direction-of-arrival measurements comprise plural mean angle-of-arrival values and a weight associated with each mean angle-of-arrival value, wherein the computer program code, when executed by the at least one processor, causes the apparatus to filter out the direction-of-arrival measurements for which the associated weight is below a threshold.

32. The apparatus according to any of claims 20 to 31, wherein the computer program code, when executed by the at least one processor, causes the apparatus to estimate the spectrogram model for each of the one or more moving audio sources by performing iterative optimisation of parameters of the spectrogram models.

33. The apparatus according to claim 32 , wherein the computer program code, when executed by the at least one processor, causes the apparatus, prior to performing iterative optimisation of the parameters of the spectrogram models, to initialize the parameters using spatial energy distribution values for each identified source which are scaled such that the spatial energy distribution values sum to unity when the source is active and which are set to zero when the source is inactive. 34· The apparatus according to any of claims 20 to 33, wherein the computer program code, when executed by the at least one processor, causes the apparatus to transform the at least two audio signals into their frequency domain representations using a short-time Fourier transform.

35. The apparatus according to any of claims 20 to 34, wherein the computer program code, when executed by the at least one processor, causes the apparatus to synthesise audio signals for each identified moving audio source based on the separated frequency domain representations of each of the moving audio sources.

36. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames;

determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame;

converting the spatial energy distribution into multiple direction-of-arrival measurements; and

based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame.

37. Apparatus comprising:

means for receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain

representations each comprise a plurality of time frames;

means for determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame;

means for converting the spatial energy distribution into multiple direction-of- arrival measurements; and means for performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame, based on at least a subset of the direction-of-arrival measurements.

Description:
SYSTEM AND APPARATUS FOR TRACKING MOVING AUDIO SOURCES

Field

This specification relates to processing audio signals and, more specifically, to processing audio signals for separating one or more moving audio sources.

Background

Spatial audio signals are being used more often to produce a more immersive audio experience. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a multi-channel loudspeaker arrangement and, with virtual surround processing, a pair of stereo headphones or headset.

Nowadays some mobile apparatuses, such as mobile phones, have more than two microphones. This offers the possibility to record real multichannel audio. With advanced signal processing it is further possible to beamform or directionally-process the audio signal from the microphones from a specific or desired direction by determining parameters such as direction associated with audio sources and processing the audio sources, based on their directions.

Summary

In a first aspect, this speciation describes a method comprising: receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames; determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; converting the spatial energy distribution into multiple direction-of-arrival measurements; and based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame. The method may further comprise determining for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame. The method may further comprise estimating a spectrogram model for one or more time frames for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals. The method may further comprise separating the frequency domain representations of the at least two audio signals based on the estimated spectrogram model using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources. The method may further comprise restoring spatial energies for each identified audio source based on the associated directions relative to the recording module array; and determining the spatial covariance matrix models for each identified source based on the restored spatial energies.

The method may comprise performing the acoustical tracking using a particle filtering algorithm. The particle filtering algorithm may comprise a Rao-Blackwellized particle filter.

The method may comprise determining the spatial energy distribution using a steered response power algorithm with a phase transform weighting.

The method may comprise converting the spatial energy distribution into multiple direction-of-arrival measurements by estimating a wrapped Gaussian mixture model of the observed spatial energy in each time frame. The method may comprise decluttering the direction-of-arrival measurements by filtering out measurements having particular parameters which do not satisfy a particular criterion. The direction-of-arrival measurements may comprise plural mean angle-of-arrival values and a variance associated with each mean value, and the method further comprise filtering out the direction-of-arrival measurements for which the associated variance is above a threshold. The direction-of-arrival measurements may comprise plural mean angle-of- arrival values and an associated weight and the method may further comprise filtering out the direction-of-arrival measurements for which the associated weight is below a threshold.

The method may comprise estimating the spectrogram model for each of the one or more moving audio sources comprises performing iterative optimisation of parameters of the spectrogram model. The method may further comprise, prior to performing iterative optimisation of the parameters of the spectrogram model, initializing the parameters using spatial energy distribution values for each identified source which are scaled such that the spatial energy distribution values sum to unity when the source is active and which are set to zero when the source is inactive.

The method may comprise transforming the at least two audio signals into their frequency domain representations using a short-time Fourier transform.

The method may comprise synthesising audio signals for each identified moving audio source based on the separated frequency domain representations of each of the moving audio sources.

The method may comprise in some examples: receiving frequency domain

representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames;

determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; converting the spatial energy distribution into multiple direction-of-arrival measurements; based on at least a subset of the direction- of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame; determining for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame; estimating for each time frame a spectrogram model for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals; and separating the frequency domain representations of the at least two audio signals based on the estimated spectrogram model using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources.

In a second aspect, this specification describes apparatus configured to perform a method as described with reference to the first aspect. In a third aspect, this specification describes computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method as described with reference to the first aspect. In a fourth aspect, this specification describes apparatus comprising at least one processor and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus: to receive frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames; to determine, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; to convert the spatial energy distribution into multiple direction-of-arrival measurements; and based on at least a subset of the direction-of-arrival measurements, to perform acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame. The computer program code, when executed by the at least one processor, may cause the apparatus to determine for each time frame a spatial covariance matrix model for each identified source based on the identified one or more moving audio sources and their associated directions relative to the recording module array for each time frame. The computer program code, when executed by the at least one processor, may further cause the apparatus to estimate for each time frame a spectrogram model for each of the one or more moving audio sources using the estimated spatial covariance matrix models and the frequency domain representations of the at least two audio signals. The computer program code, when executed by the at least one processor, may further cause the apparatus to separate the frequency domain representations of the at least two audio signals based on the estimated spectrogram models using time-frequency filtering to produce separated frequency domain representations of each of the moving audio sources. The computer program code, when executed by the at least one processor, may cause the apparatus to restore spatial energies for each identified audio source based on the associated directions relative to the recording module array, and to determine the spatial covariance matrix models for each identified source based on the restored spatial energies. The computer program code, when executed by the at least one processor, may cause the apparatus to perform the acoustical tracking using a particle filtering algorithm. The particle filtering algorithm may comprise a Rao-Blackwellized particle filter.

The computer program code, when executed by the at least one processor, may cause the apparatus to determine the spatial energy distribution using a steered response power algorithm with a phase transform weighting. The computer program code, when executed by the at least one processor, may cause the apparatus to convert the spatial energy distribution into multiple direction-of- arrival measurements by estimating a wrapped Gaussian mixture model of the observed spatial energy in each time frame. The computer program code, when executed by the at least one processor, may further cause the apparatus to declutter the direction-of-arrival measurements by filtering out measurements having particular parameters which do not satisfy a particular criterion. The direction-of-arrival measurements may comprise plural mean angle-of-arrival values and a variance associated with each mean angle-of-arrival value, and the computer program code, when executed by the at least one processor, may cause the apparatus to filter out the direction-of-arrival measurements for which the associated variance is above a threshold. The direction-of-arrival measurements may comprise plural mean angle-of- arrival values and a weight associated with each mean angle-of-arrival value, wherein the computer program code, when executed by the at least one processor, causes the apparatus to filter out the direction-of-arrival measurements for which the associated weight is below a threshold.

The computer program code, when executed by the at least one processor, may cause the apparatus to estimate the spectrogram model for each of the one or more moving audio sources by performing iterative optimisation of parameters of the spectrogram model.

The computer program code, when executed by the at least one processor, may cause the apparatus, prior to performing iterative optimisation of the parameters of the spectrogram model, to initialize the parameters using spatial energy distribution values for each identified source which are scaled such that the spatial energy distribution values sum to unity when the source is active and which are set to zero when the source is inactive.

The computer program code, when executed by the at least one processor, may cause the apparatus to transform the at least two audio signals into their frequency domain representations using a short-time Fourier transform.

The computer program code, when executed by the at least one processor, may cause the apparatus to synthesise audio signals for each identified moving audio source based on the separated frequency domain representations of each of the moving audio sources.

In a fifth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames;

determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; converting the spatial energy distribution into multiple direction-of-arrival measurements; and based on at least a subset of the direction-of-arrival measurements, performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame. The computer-readable code stored on the medium of the fifth aspect may further cause performance of any of the operations described with reference to the method of the first aspect. In a sixth aspect, this specification describes apparatus comprising: means for receiving frequency domain representations of at least two audio signals, each recorded by a respective microphone of a microphone array comprising at least two microphones, the audio signals including at least one component resulting from one or more moving audio sources, wherein the frequency domain representations each comprise a plurality of time frames; means for determining, based on the frequency domain representations of the at least two audio signals, a spatial energy distribution over a set of directions around the recording module array for each time frame; means for converting the spatial energy distribution into multiple direction-of-arrival measurements; and means for performing acoustical tracking to identify the one or more moving audio sources and their associated directions relative to the recording module array for each time frame, based on at least a subset of the direction-of-arrival measurements. The apparatus of the sixth aspect may further comprise means for causing performance of any of the operations described with reference to method of the first aspect.

Brief Description of the Figures

For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure ι shows schematically an electronic apparatus including audio signal processing apparatus;

Figure 2 shows in a functional manner an example of an audio signal processing apparatus which may form part of the apparatus of Figure l;

Figure 3 is a flow diagram illustrating various operations which may be performed by the audio signal processing apparatus of Figure 2;

Figures 4A illustrates graphically direction-of-arrival measurements for a single time frame derived from a sample audio signal including components derived from two audio sources;

Figure 4B illustrates graphically direction-of-arrival measurements for all time frames derived from the same sample audio signal;

Figure 4C illustrates graphically the direction-of-arrival measurements for all time frames following de-cluttering;

Figure 4D illustrates graphically the paths of the two moving audio sources which have been identified based on the decluttered direction-of-arrival measurements of Figure 4C;

Figure 5 is a flow diagram illustrating various operations which may be performed by a spectrogram parameter estimator when estimating the spectrogram models for the detected sources; and

Figure 6 shows schematically example microphone configurations in an apparatus for capturing audio signals derived from one or more moving audio sources.

Detailed Description of Embodiments

In the description and drawings, like reference numerals refer to like elements throughout. The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective sound-field directional processing of audio recording, for example within audio-video capture apparatus. In the following examples, audio signals and processing audio signals are described. However, it should be appreciated that, in some embodiments, the audio signal/audio capture and processing maybe a part of an audio- video system.

Mobile devices or apparatus are nowadays more commonly being equipped with multiple microphone configurations or microphone arrays suitable for recording or capturing the audio environment or audio scene surrounding the mobile device or apparatus. A multiple microphone configuration enables the recording of stereo or surround-sound signals and the known location and orientation of the microphones further enables the apparatus to process the captured or recorded audio signals from the microphones to perform spatial processing to emphasise or focus on the audio signals from a defined direction relative to other directions.

One way to perform spatial processing is to initially extract and manipulate the direction or sound source dependent information and to use this information in subsequent applications. These applications can include, for example, spatial audio coding (SAC), 3D sound-field analysis and synthesis, sound source separation and speaker extraction for further processing such as speech recognition. In general, the field that studies such spatial sound processing is known as blind source separation (BSS) for simultaneously emitting sound sources. A classic example of such a case is known as the cocktail party problem enabling the separation of each individual speaker from the party recorded using a microphone array. The field of BSS has been intensively studied, but is still categorized as an unsolved problem. In real life use cases, the capturing or recording apparatus or device usually consists of a small hand held device having multiple microphones. The multiple channels and their information correlation and relationship can then be utilized for source separation and direction of arrival estimation.

Furthermore, applications employing such analysis, such as 3D sound-field analysis and synthesis, can employ the accurate and detailed directional information of the separated sources when rendering the captured field by positioning the source using either binaural synthesis by head related transfer function (HRTF) filtering or source positioning in multichannel and multidimensional loudspeaker arrays using source positioning techniques such as vector base amplitude panning (VBAP).

Blind sound source separation (BSS) of audio captures recorded using a small and enclosed microphone array such as conventionally found on a mobile device or apparatus can include the following problems and difficulties that are addressed herein by embodiments as described herein.

Firstly, the number of microphones is typically small, approximately 2-5 capsules, because of design volume and cost arrangements making the source direction of arrival (DoA) estimation difficult and pure beamforming-based separation inefficient.

Beamforming for source direction of arrival detection and related problems and more recently spherical array beamforming techniques have been successfully used in sound field capture and analysis and also developed into final products. However, the problem with spherical array processing is the array structure and the sheer size of the actual arrays used prevents it from being incorporated into a single mobile device. Furthermore pure beamforming does not address the problem of source separation but analyses the spatial space around the device with beams as narrow as possible. The side-lobe cancellation for decreasing the beam width generally requires increasing the microphone count of the array, which is costly in volume, device complexity and cost of manufacture.

Furthermore, the small geometrical distance between capsules reduces the time-delay between microphones which require capturing using high sampling rate in order to observe the small time instance differences. When high sampling frequency is used there are problems with frequency domain based BSS methods in form of spatial aliasing. In other words audio frequencies with wavelength less than two times the distance of the microphone separation can cause ambiguity in resolving the time delays in form of a phase delay after a short time Fourier transform (STFT).

For example, independent component analysis (ICA) can be applied in frequency domain to estimate statistically independent components at each frequency. The frequency-domain ICA leads to an arbitrary source ordering at each frequency. This permutation ambiguity has been solved by different means over the years including mixing filter frequency response smoothness, temporal structure of the source signals, and time-difference of arrival (TDoA) and direction of arrival (DoA), and interpretation of ICA mixing parameters. Furthermore, there also exist ICA-based methods that avoid the permutation problem by unifying the source independencies across frequencies. However, ICA based separation is one of the methods which is sensitive to problems caused by spatial aliasing in permutation alignment and in unifying the source independencies over frequency.

Furthermore, non-negative matrix factorization (NMF) based separation in

multichannel cases has been proposed. These include for example multichannel NMF for convoluted mixtures, however the EM-algorithm used for parameter estimation is inefficient without oracle initialization, (in other words knowing source characteristics for initializing the algorithm). Complex multichannel NMF (CNMF) with multiplicative updates has been proposed with promising separation results. The proposed CNMF algorithms estimate the source spatial covariance properties and the magnitude model. However the spatial covariance matrices are estimated and updated individually for each frequency bin making the algorithm prone for estimation errors at high frequencies with spatial aliasing. Also the estimated covariance properties are not connected to the spatial locations of the sources.

In addition, direct source magnitude envelope and spatial covariance matrix estimation has been proposed. The spatial properties are estimated frequency bin-wise leading again into permutation ambiguity and with a separate algorithm for solving the component ordering, making it inefficient with high sampling rate captures.

Additionally, the problem includes solving and executing 3D sound synthesis of the separated sources. It should further be understood that where spatial processing is performed with respect to spatial audio synthesis, such as 3D audio synthesis, the 3D synthesis of the separated sources or parts of the sources require pairing the separation algorithm with DoA analysis making the system potentially discontinuous and less efficient for the 3D sound scene analysis-synthesis loop. As such, an enclosed microphone array with an unknown directivity pattern of each capsule requires a machine learning based algorithm for learning and compensating the unknown properties of the array.

Moving sound sources add another layer of complexity to the sound source separation methods discussed above. Thus, the concept as described herein in further detail is one which the audio recording system provides apparatus and/or methods for separating moving audio sources using plural microphones in one device. More specifically, this specification describes a blind sound source separation method for a dynamic scenario captured using a spaced microphone array, which may in some examples be a compact array (for instance, in a mobile device). In other examples, the spaced microphone array may be made up of more than one physically separate device each including at least one microphone. As will be understood from the below description, methods described herein may be based on online tracking by particle filtering and estimating a NMF-based spectral model of tracked sources to separate them by time-frequency filtering.

In some examples described below, observed spatial energy as a function of time and direction of arrival is calculated for the signal under analysis. The observed spatial energy may be in the form of a steered response power (SRP). Subsequently, a wrapped Gaussian mixture model (WGMM) of the observed spatial energy

distributions in each time frame may be estimated. The WGMM means and variances may then be used as direction of arrival measurements for acoustic tracking, for instance using a Rao-Blackwellised particle filter. The acoustic tracking detects or identifies the underlying sources, associates the means and variances with the detected sound sources and outputs the source state in each time frame. Based on the acoustic tracking output, a DOA-based spatial covariance matrix (SCM) model may be defined for each tracked source for each timeframe. The SCM model denotes the spatial behavior of sources, and a spectrogram model of sources consisting of evidence originating from the tracked direction is estimated. The individual source signals may then be reconstructed using a separation mask formulated as a Wiener-filter based on the estimated spectrogram model of each source.

In this regard reference is first made to Figure 1 which shows a schematic block diagram of an exemplary electronic apparatus 10, which may be used to record (or capture) audio signals derived from one or more audio sources.

The electronic apparatus 1 may, for example, be a mobile terminal or user equipment of a wireless communication system. In some examples, the apparatus 1 may be an audio player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable apparatus suitable for recording audio or audio/video. The electronic apparatus l may in some embodiments comprise an audio subsystem 10. The audio subsystem 10 may for example comprise an array of microphones 11 for audio signal capture. The array of microphones may be solid state microphones, in other words capable of capturing audio signals and outputting a suitable digital format signal, in other words not requiring an analogue-to-digital converter. The microphones 11 may alternatively comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or micro electrical-mechanical system (MEMS) microphone. The microphones n of the array may in such embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.

The audio subsystem 10 may further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and to output the audio captured signal in a suitable digital form. The ADC converter 14 may be any suitable analogue-to-digital conversion or processing means. In some examples, in which the microphones are "integrated" microphones, the microphones may contain both audio signal generating and analogue-to-digital conversion capability.

The audio subsystem 10 may, in some examples further comprises a digital-to-analogue converter 32 for converting digital audio signals received from a processing apparatus 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 may utilise any suitable DAC technology.

Furthermore, the audio subsystem may comprise, in some embodiments, a speaker 33. The speaker 33 may receive the output from the DAC 32 and present the analogue audio signal to the user. The speaker 33 may be representative of multi-speaker arrangement, a headset, for example a set of headphones, or cordless headphones.

Although the electronic apparatus 1 is shown having only audio capture and audio presentation components, it should be understood that, in some embodiments, the electronic 10 may additionally comprise only video capture and video presentation components such as a camera (for video capture) and/or a display (for video

presentation). In some embodiments, the audio subsystem 10 comprises a control apparatus 20 for controlling the other components of the audio subsystem 10. The control apparatus 20 may be coupled to the ADC 14 for receiving digital signals representing audio signals from the microphone 11 and/or to the DAC to provide digital for presentation to the user via the speaker.

The control apparatus may comprise processing apparatus 21 coupled with memory 22. The processing apparatus 21 may be configured to execute various program codes. The implemented program codes may comprise for example audio recording and audio presentation routines. The program codes may be configured to perform audio signal processing. As such, the control apparatus 20 may, in some examples, be referred to as audio signal processing apparatus 20.

The memory 22 may be any suitable storage means. In some embodiments the memory 22 may comprise a program code section 23 for storing program codes implementable using the processing apparatus 21. Furthermore, in some embodiments the memory 22 may further comprise a stored data section 24 for storing data, for example data that has been encoded in accordance with the application or data to be encoded via the application embodiments as described later. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processing apparatus 21 whenever needed via the memory-processor coupling.

In some further embodiments the electronic apparatus 10 may comprise a user interface 15. The user interface 15 may be coupled in some embodiments to the processing apparatus 21. The processing apparatus 21 may control the operation of the user interface 15 and receive inputs from the user interface 15. The user interface 15 may enable a user to input commands to the electronic apparatus or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 may in some embodiments as described herein comprise a touch screen or touch interface capable of both enabling information to be entered to the electronic apparatus 10 and further displaying information to the user of the device 10. In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments may be coupled to the processing apparatus 21 and configured to enable a communication with other electronic apparatuses, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/ or receiver means may be configured to communicate with other electronic apparatuses via a wire or wired coupling.

The transceiver 13 may be configured to communicate with further apparatus by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means may use a suitable universal mobile

telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

It is to be understood that the structure of the electronic apparatus 10 may be supplemented and varied in many ways.

Figure 6 illustrates an example of an electronic apparatus 1, such as that described with reference to Figure 1, which comprises a front face 301 comprising a camera 51, a rear face 303 and a top edge or face 305. In the example shown in Figure 4, the audio subsystem 10 of the apparatus 1 comprises four microphones. Specifically, in the example shown, the apparatus 1 comprises a first (front right) microphone Hi located at the front right side of the apparatus 1 (where right is towards the top edge of the front face 301 of the apparatus 1), a front left microphone n 3 located at the front left side of the apparatus 1, a right high microphone n 2 located at the top edge or face side of the apparatus 1, and a left rear microphone n 4 located at the left rear side of the apparatus 1. Although in the example of Figure 6 there are four microphones 11, it should be understood that in some embodiments there may be more than or fewer than four microphones 11 and that the microphones may be arranged or located on the apparatus 10 in any suitable manner. Furthermore, although in Figures 1 and 6, the microphones 11 are shown as part of the apparatus 1, it should be understood that, in some embodiments, the microphone array may be physically separate from the apparatus 1. For example, the microphone array can be located on a headset which wirelessly or otherwise passes the audio signals to the audio processing apparatus 20 for processing. In other examples, the microphones may be located at a different location to the audio signal processing apparatus 20 and/ or the processing of the captured signals may be carried out at an unspecified time after their capture. For example, the processing may be performed by a web server while the signals may be captured by a user device such as a mobile phone.

Figure 2 is a schematic functional illustration of the audio processing apparatus 20 according to some embodiments. Figure 3 is a flow diagram illustrating various operations which may be performed by the audio processing apparatus as shown in Figure 2.

The apparatus 10 is configured to receive digital representations of audio signals captured by at least two microphones 11. Two microphones may be used, for instance, when the one or more sound source(s) are located approximately within an arc of 180 degrees relative to the array. In examples, in which the one or more sound source(s) are located at any location surrounding the microphone array (e.g. in a 360-degree arc), three or more microphones may be more suitable. If it is desired to determine a direction of the audio sources anywhere in three-dimensions surrounding the array, at least four microphones may be appropriate.

The digital representations of audio signals captured by the at least two microphones 11 may be received directly from the microphones 11 (in the case of "integrated

microphones") or via the ADC 14 in the case of microphones which output an analogue audio signal.

Short Time Fourier Transformation As illustrated in Figure 2, the audio signal processing apparatus may in some examples comprise a short time Fourier transformer (STFT) 101 for transforming received time domain audio signals into the frequency domain. The digital representations x m (t) of the audio signals captured by the microphone array are received by the STFT 101. The operation of receiving the microphone input audio signals is shown in Figure 3 by step 201. In the examples described herein, there are one or more audio sources 200 in an audio capture area around the microphone array 11. The microphone array 11 can be considered to capture in the time domain the sound or audio sources which have been convolved with their spatial responses. This can be mathematically modelled or described as:

Equation 1 where: x m (t) is the mixture of p = 1...P sources captured by microphones m = 1...M. (in other words the microphone m receives the audio signal x m );

the sample index is denoted by t; and

T is an index for the convolution with the spatial response h.

In this "mixing" model the spatial response from the source p to the microphone m is denoted by h mp ( ) and the source signals are given as s p (t).

The STFT 101 is configured to perform a short time Fourier transform on the captured audio signals. The STFT 101 may be configured to calculate the STFT of a time domain signal by dividing the captured audio signals into small overlapping windows, applying the window function and taking the discrete Fourier transform (DFT) of it.

The "mixing" model can be approximated in the STFT-domain as:

p p

%f,n ¾ ^ ' hf,n,p s f,n,p ~ ^ ' y 'f ' ,η,ρ

p=l p=l

Equation 2 where: Xf, n is the short-time Fourier transform (STFT) of the array capture x m and may be denoted as Xf >n = each time frame-frequency point (f,n) of each input channel (m = ι,.,.,Μ),

f= i... is the frequency index,

n = i...Nis the frame index,

Sf,n, P is the denotation of the single-channel STFT of each source p, and h/,η,ρ (which is fixed within each time frame n) is the frequency domain room impulse reverberations (RIRs) of each source and may be expressed as h/,„, p = [hf,n,i,...,hf, n ,M] T , and

y/,η,ρ is the denotation of the STFT of the reverberated source signals.

In words, the STFT of the array capture x / , n may be described as the sum of the STFTs of the reverberated individual source signals y/,π,ρ ·

The operation of transforming the time domain signals into the frequency domain is shown in Figure 3 by step 301. The output of the STFT 101, Xf_ n (the frequency domain form of the audio signals) is provided to a discrete spatial distribution calculator 102.

Discrete Spatial Distribution Calculation The audio signal processing apparatus comprises discrete spatial distribution calculator 102. The discrete spatial distribution calculator 102 is configured to calculate a discrete spatial distribution of the captured audio signals on the basis of the received frequency domain form of the audio signals. The discrete spatial distribution calculator 102 may be said, in more specific terms, to calculate the observed spatial energy z n ,o over all directions o around the microphone array 11. For this we define 0=1...O as the set of directions around the microphone array. Each individual spatial energy distribution value z n ,o for a particular direction o and time frame n may be referred to a spatial weight for that direction and time frame. In some embodiments, the observed spatial energy z n ,o may calculated using a steered response power (SRP) algorithm with a phase transform (PHAT) weighting (which may be referred to as the Steered Response Power - Phase Transform algorithm

(SRP-PHAT)), although in other embodiments different algorithms may be used. The observed spatial energy z n ,o may be mathematically modelled or be described as:

Equation 3 where: τ 0 (m 1( m 2 ) is the difference between times of arrival of sounds at a pair of microphones (m 1 ,m 2 ).

The discrete spatial distribution calculator 102 may utilise knowledge of the array geometry in order to calculate the time it takes sound to arrive from direction o to microphone m, τ 0 (m). This is illustrated in Figure 2 as an input to the discrete spatial distribution calculator.

For the representation of Equation 3, all sound sources may be assumed to be far field. Alternatively, the sound sources may be assumed to be at a fixed distance from the center of the microphone array (which may be the center of the electronic apparatus 1 depicted in Figure 4), for instance 2m.

The operation of calculating the discrete spatial distribution based on the frequency domain signals is shown in Figure 3 by step 302.

The output z n o of the discrete spatial distribution calculator 102 is provided to a direction-of- arrival estimator 103. The output z n o of the discrete spatial distribution calculator 102 may be referred to as the signal energy for each direction 1...0 and time instant 1...N.

Direction-of-Arrival Estimation

The audio signal processing apparatus further comprises a direction-of-arrival estimator 103. The direction-of-arrival estimator 103 is configured to convert the discrete spatial distribution z n 0 into multiple direction-of-arrival (DOA) measurements.

The direction of arrival measurements may be made up of associated sets of

measurements, each including an estimated mean angle (direction of arrival), an associated variance and an associated weight. The conversion of the discrete spatial distribution z n o into multiple direction-of-arrival (DOA) measurements may, in some examples, be performed by estimating parameters of a wrapped Gaussian mixture model (WGMM) for each time-frame. The wrapped Gaussian mixture model may be defined as:

Equation 4

where: μ is an estimated mean angle of the Gaussian distribution model;

o 2 is the variance associated with the mean angle,

Ν(θ; μ , σ ) is a probably distribution function of a regular Gaussian distribution of with the same mean and variance;

Kis a predefined constant that may typically between three and five; and ακ is a weight associated with the mean angle.

The direction-of-arrival estimator 103 may be configured to estimate the

parameters/measurements (the mean, variance and weight) of the WGMM by estimating a WGMM for each time frame n of the observed spatial energy z no . More specifically, a mixture of k = ι,.,.,Κ wrapped Gaussians for each time-frame n may be estimated. The mean angle μ η ,ι , the variance in the angle a n ,k and the weight a n ,k may be considered as permutated measurements and may be obtained as results of the following optimization problem:

Equation 5

where: 0 O corresponds to the azimuth angle of direction o; and

z no .is the observed spatial energy as defined in equation 3.

The problem of equation 5 is a WGMM parameter estimation in one dimension with observed discretized distribution. The minimization of Equation 5 is a conventional non-linear, least squares problem and numerous methods for solving its parameters exist in literature. Such methods include that described in R. H. Byrd, R. B. Schnabel, and G. A. Shultz, "A trust region algorithm for nonlinearly constrained optimization," SIAM Journal on Numerical Analysis, vol. 24, no. 5, pp. 1152-1170, 1987.

As implied above, the WGMM parameters may be referred to as direction-of-arrival measurements. In some examples, the direction of arrival measurements output by the direction of arrival estimator 103 may include the mean and variance of each tracked source for each time instant.

The operation of estimating the direction-of-arrival measurements based on the discrete spatial distribution is shown in Figure 3 by step 303.

Figures 4A to 4C are graphical illustrations of estimated direction-of-arrival

measurements based on an arbitrary test signal at various stages of processing. Figure 4A is a graph of the mean angle μ (on the x-axis) against weighting factor a (on the y-axis) and illustrates an observed (experimentally measured) spatial energy z no for a single time-frame n and the three component (K=3) WGMM estimate of the spatial energy. Figure 4B is a graph of the time in seconds (on the x-axis) against the mean angle μ (on the y-axis) for all time frames.

Figure 4C is derived from the graph of Figure 4B but has had all the means angles μ η ¾ having the standard deviation o n k greater than 1 radian (approximately 57 degrees) and the weighting n fe less than 0.005 filtered out. This serves to remove the "clutter" of false measurements (put another way, to "declutter" the measurements) before a tracking algorithm is applied to the results. This clutter may result from the modelling of the noise in the SRP algorithm. In some examples, the direction-of-arrival estimator 103 may be configured to

"declutter" the direction-of-arrival measurements by filtering out measurements having particular parameters which do not satisfy particular criterion. For instance, similarly to as described with reference to Figure 4C, the direction-of-arrival estimator 103 may be configured to remove measurements having a standard deviation o n k above a particular threshold value and/or a weight below another threshold value. The thresholds may be determined experimentally.

Since the SRP-PHAT may be scaled to have values in range of [0,1], the weight threshold may be relatively universal, but may depend on the level of background noise, e.g. SNR of the target to be tracked. Typical values for the weight threshold may be from o.i to o.ooi. The threshold for the standard deviation may be determined in dependence on, for example, the array geometry (spatial resolution and how sharp the peaks in SRP-PHAT are). Typical values for the standard deviation threshold may be from 5 degrees to 6o degrees (approximately o.i to l radians) (in other words, the spatial window of how "wide" peaks are to be considered as non-clutter

measurements).

In other examples, the decluttering of the direction-of arrival measurements may be performed by the acoustical tracker 104, which also forms part of the audio signal processing apparatus 20.

Acoustical Tracking

The acoustical tracker 104 receives the direction-of-arrival measurements from the direction of arrival measurer 103. As discussed above, the received direction of arrival measurements may be "cluttered" or may have been "decluttered" by the direction of arrival estimator 103.

The acoustical tracker 104 is configured to track acoustical sources in each time frame based on the received direction-of-arrival measurements. The acoustical tracking is performed after decluttering of the direction-of-arrival measurements.

The acoustical tracker 104 may be configured to track multiple targets (or sound sources) in the captured audio signal by particle filtering the direction-of-arrival measurements to estimate the direction of arrival trajectories of the audio sources and to associate to the direction-of-arrival measurements with a particular one of the audio sources p = ι,.,.,Ρ.

When performing the tracking of the multiple targets, the acoustical tracker 104 converts the wrapped one dimensional angle measurements (jn n ,k) to a two-dimension point on a unit circle y^), where superscript (k) denotes multiple measurements within one time frame n. This may be performed using a rotating vector model such as described in H. Nies, O. Loffeld, andR. Wang, "Phase unwrapping using 2d-kalman filter-potential and limitations," in Geoscience and Remote Sensing Symposium, 2008. IGARSS 2008. IEEE International, vol. 4. IEEE, 2008, pp. IV-1213. The conversion may serve to linearize the measurement model matrix and the state transition. The conversion may be performed using the following expression:

J n

sin( n, fe )

Equation 6

In some examples, a dynamic state-space model may be used by the acoustical tracker 104. In the dynamic state-space model, the state of each source is considered as a 2-D point on a unit circle. In addition, a constant velocity model may be used. In such examples, the underlying state of the dynamical system may be defined by:

Equation 7 where: x is the state of each tracked source at a time frame n;

x and y are the x and y coordinates of the tracked source;

x and y are velocities along x-axis and y-axis, respectively of the tracked source; and

superscript P) is an index of the tracked sources p = ι,.,.,Ρ.

The acoustical tracker 104 may be configured to particle filter the converted measured angles to detect or identify the sound sources present in the observed data. The acoustical tracker 104 then associates particular measured angles with a particular detected sound source. Alternatively, if none of the active source particle distributions (the spatial distribution of an active source) indicates a probability of a current measurement belonging to the active source that is higher than the clutter prior probability, then the measurement is labeled as clutter. The clutter prior probability is a fixed pre-set value intended to validate when (the probability being above the clutter prior threshold) the current measurement is linked to existing source. A typical value for the clutter prior probability may be, for instance 0.15.

The acoustical tracker 104 may be configured to track the multiple targets (or sound sources) using Rao-Blackwellized particle filtering for instance as described in either of S. Sarkka, A. Vehtari, and J. Lampinen, "Rao-Blackwellized particle filter for multiple target tracking" (Information Fusion, vol. 8, no. 1, pp. 2-15, 2007) or "Rao- Blackwellized Monte Carlo data association for multiple target tracking,"

(Proceedings of the seventh international conference on information fusion, vol. 1. 1, 2004, pp. 583-590).

The output of the particle filtering is the state of each detected audio source p at each time frame, denoted by x„ .

In some examples, the acoustical tracker 104 may be configured to model internally the estimated state x^ using Kalman filtering thereby to provide a reliability measure (d n p ) for each state estimation x^ .

The acoustical tracker 104 may be further configured to extract the direction-of-arrival from the tracked source state x^ . This may be achieved by calculating the angle of the vector (the azimuth angle) defined by the two-dimensional points. This may be obtained as: fl n>p <- t n2( n (2)/ n (l))

Equation 8 where: fl n p is the estimated direction of source p at time frame n;

x n (2) is the y coordinate of the object location; and

Xn(i) is the x-coordinate of the object location. Although the acoustic tracking has been described above with reference to only two dimensions (based on the assumption that sound source is located in a single horizontal plane), it will be appreciated that the same equations can easily be extended into three dimensions if it is required to track the audio sources in three, instead of two, dimensions.

The estimated direction fl n p of source p at time frame n may be the output of the acoustical tracker 104. Put another way, the output of the acoustic tracker 104 may be said to be the number and direction of tracked audio source for each time instant 1...N. In addition, the acoustical tracker 104 may also be configured to output the standard deviation σ η ρ associated with the estimated direction fl n p .

The operation of the acoustical tracker 14 in tracking the audio sources in the captured audio signals is shown in Figure 3 by step 304.

The results produced by the acoustical tracker 104 as described above when using the measurements depicted in Figure 4C1S illustrated graphically in Figure 4D. Specifically, Figure 4D shows the output of the acoustical tracker 104 converted back into the azimuth angle using Equation 8. As can be seen, the acoustical tracker 104 has

(correctly) detected two sound sources. However, in Figure 4D, the fact that the two audio sources start from approximately same position has delayed the detection of the second source. This starts only after detection of a second distinct noise (which was in fact a spoken word) from the second sound source after approximately 4 seconds.

Spatial Covariance Processing

In some examples, the audio signal processing apparatus 20 further comprises a spatial covariance processing module 105. This may be configured to receive the output from the acoustical tracker 104 and to restore the spatial weights (or spatial energy distribution values) z n o p for the detected sound sources and their directions in each time frame. In other examples, however, the spatial weights z n>0>p may be restored by another functional unit/module, for instance but not limited to the acoustical tracking module 104. Restoration of spatial weights z n>0>p may be performed using the wrapped Gaussian distribution model defined in Equation 4. When restoring the spatial weights z n o p , it is assumed that the weighting factor ακ is equal to 1. Put another way, it is assumed that each source direction at each time frame is considered to consist of a single wrapped Gaussian. As such, the spatial weights for the detected sound sources are given by:

Ζ-η,ο,ρ ~ Ρτυ(βοΊ βη,ρ> σ η,ρ)

Equation 10 where: μ η ρ is the direction-of-arrival of sound from each detected source at each time frame (extracted from the tracked source state calculated by the acoustical tracker 104); and

<¾p is the variance for the extracted direction of arrival for each detected source at each time frame.

The spatial covariance processing module 105 is configured to perform spatial covariance on the spatial weights z n o p to produce time-varying source spatial covariance matrices (SCMs) for each time instant 1...N.

The time-varying source SCMs produced by the spatial covariance processing module 105 may be defined as:

0

Equation 11

where: H ,n,p are SCMs of the frequency domain time-dependent room impulse responses (RIRs); and

W 0 is a direction of arrival kernel for a particular pair of microphones mi, m 2 which are utilised to capture the audio signals.

The direction of arrival kernel Wf i0 in the expression of Equation 11 may be defined as:

Equation 12

where: fj is the frequency off h discrete Fourier Transform bin index; and

Tko(mi,m2) denotes the time difference of arrival (time delay) between two microphones caused by a source at direction k a .

The direction k 0 in the above expression is defined as a vector pointing towards a direction parameterized by azimuth θ 0 e [θ,2π] and elevation φ 0 6 [θ,π] originating from the geometric center of the microphone array. In a general case, the direction vectors k a indexed by o would sample the space around the array approximately uniformly. However, in some embodiments, it is assumed that the sources of interest lie approximately on the x -plane with elevation being zero such that all the direction vectors k 0 differ only by their azimuth. The directional statistics used in the acoustical tracking of the sound sources simplifies to a univariate case when the sampling of the spatial space around the array is by azimuthal information only. In other examples, the acoustical tracking described above may be performed along all three axes (x , y, z). In such examples, it is straightforward to define vectors k 0 which differ in terms of both their azimuth and elevations.

Based on the known geometry of the microphone array, time difference of arrival

Tko(mi,m2) in Equation 12 can be obtained relatively easily, for instance as specified in J. Nikunen and T. Virtanen, "Direction of arrival based spatial covariance model for blind sound source separation" (IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 727-739, 2014). The direction of arrival kernel Wf i0 of Equation 12 simply results from converting the time difference of arrival Tk 0 (mi,m2) to a phase difference.

The direction of arrival kernel W f 0 may be pre-stored by spatial covariance processing module 105, or alternatively may be determined as required based on received information defining the geometry of the microphone array which has been used to capture the audio signals.

The operation of the spatial covariance processing module 105 to produce time-varying spatial covariance matrices (SCMs) for each source for each time instant 1...N is illustrated in Figure 3 by operation 305.

The time-varying spatial covariance matrices of the detected sources generated by the spatial covariance processing module 105 may be passed to the parameter estimator 106, which may form part of the audio signal processing apparatus 20. Parameter Estimation

The parameter estimator 106 is configured to utilize the received time-varying spatial covariance matrices of the detected sources to estimate spectrogram parameters for use by a source separator 107 in separating the microphone array capture into their constituent sources.

The parameter estimation may include estimating the non-negative matrix

factorization (NMF) model as the spectrogram model. The NMF model for the sources Sf,n, mav be described as:

Sf,n,p ~ ∑q = i_ bq,ptf,q v q l n ^q,p> ^-f,q> v q,n — 0

Equation 13

where: tf >q , v q , n , and b q , p are the spectrogram parameters, and

tf, q ,f = i,...,F represents the magnitude spectrum of one NMF component q for each frequency bin/,

v q ,n, n = ι,.,.,Ν is the gain of the NMF component in each frame n, and bq, p represents a soft decision of NMF component q belonging to source p.

In the above expression, one NMF component q represents a single spectrally repetitive event from a mixture of audio sources which is observed by the signals captured by the microphone array 11. One audio source is modeled as a sum of multiple NMF components q.

In addition to the received time-varying spatial co variance matrices H/ ,n,P> the parameter estimator 106 may be configured also to utilize the frequency domain form of the array capture x fin (as output by the STFT 101) when estimating the spectrogram parameters for each source.

Specifically, the parameter estimator 106 may utilize the square rooted version of the frequency domain form of the array capture Xf_ n to produce a spatial covariance matrix (X n 6 C x ) for each time-frequency point. This may be calculated as follows:

Equation 14

Substituting the expression in Equation 2 into Equation 14 gives us the expression for the mixture SCMs as:

Equation 15

Substituting Sf >n>p from equation 13 and H/ ,n,P from Equation 11 into Equation 15 gives us:

p 0 Q

Xf,n ¾ Xf,n = ^ ^ W 0 ½,o,p ^ bq,p tf,q v q,n

p=l p=l , q=l

#/,n,p (£¾.15) s n p (£q.l3)

Equation 16

The expression of Equation 16 may be referred to as the complex non-negative matrix factorization (CNMF) model. The parameter estimator 106 may be configured to estimate the spectrogram parameters b q p tf q v q n using the derived CNMF model of equation 16.

The estimation may be an iterative process which serves to iteratively optimize the parameters. In such examples, the optimization may be performed using an assumption that H ,n,p is set externally (based on z n>0>p calculated during the acoustical tracking) and remains constant during the parameter estimation process.

The parameter estimator 306 may be configured to obtain multiplicative updates for estimating the optimal spectrogram parameters in an iterative manner by partial derivation of the total modeling criterion (or, put another way, the cost function). The multiplicative updates for finding the optimal parameters of the CNMF model of Equation 16 may be performed using an optimization criterion (cost-function) of squared Frobenius norm or Itakura-Saito divergence.

The parameter estimator 306 may additionally use auxiliary variables, for instance as described in the expectation maximization algorithm of A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm" (Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977).

The technique for CNMF parameter estimation with multiplicative updates is proposed and presented in 'Multichannel extensions of non-negative matrix factorization with complex-valued data" (IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 971- 982, 2013). This document also sets out the entire probabilistic formulation. Use of the proposed technique for CNMF parameter estimation with multiplicative updates is described in J. Nikunen and T. Virtanen, "Direction of arrival based spatial covariance model for blind sound source separation" (IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 727-739, 2014).

The cost function utilized by the parameter estimator 106 in optimizing the

spectrogram parameters may in some examples be the squared Frobenius norm, which can be described Accordingly, the update equations for the non-negative parameters as may be utilized by the parameter estimator may be described as:

, , ∑f,n tf iq Vg in tr{Xf i7l Hf n p

∑f,n tf,q v q,n tr Xf,nHf,n,p)

Equation 17

Equation 18

∑f,p bq,ptf,qte{Xf, n Hf,n,p

Equation 19

The estimated spectrogram parameters may be output by the parameter estimator 106 to the source separator 107, which may also form part of the audio signal processing apparatus 20. The estimation of the spectrogram parameters is illustrated in Figure 3 as operation 307. The estimation of the spectrogram parameters by the parameter estimator 306 is described in more detail below with respect to the flow chart of Figure 5·

Source Separation

The source separator 107 may be configured to separate the array capture x m (more specifically the short-time Fourier transform (STFT) of the array capture x f n ) into individual sources on the basis of the received spectrogram parameters b q p tf q v q n (which define the real value magnitude spectrogram, s/ ,n,P )- This may be performed using a Wiener filter.

In addition to the spectrogram parameters, the source separator 107 may, in some examples, also utilise the SCMs of the frequency domain time-dependent room impulse responses {Hf n p ), for instance, as estimated or otherwise defined by the spatial covariance processing module 105.

For example, source separator 107 may be configured to obtain the reverberated source signals y/,π,ρ m the frequency domain by using multichannel Wiener filtering, as follows:

Equation 20

where: Sf n p is the real value magnitude spectrogram defined by the spectrogram

parameters as per Equation 13,

Hf,n,p represents the SCMs of the frequency domain time-dependent room impulse responses, and

Xf in is the frequency domain transform of the array capture.

In other examples, the source separator 107 may be configured to obtain the reverberated source signals y/,n,p-by using plain magnitude-based Wiener filtering, as follows:

Equation 21

The source separator 107 may output the separated source signals in the frequency domain (i.e. y/,π,ρ)·

The operation of the source separator 107 to separate the array capture into individual audio sources is illustrated in Figure 3 by operation 307. Spatial Synthesis

In some embodiments, the audio signal processing apparatus 20 may comprise a spatial synthesiser 108. The spatial synthesiser 108 may be configured to receive the output of the source separator 107 and to regenerate the source signals. This may be performed using an inverse short-time Fourier transformer 108-1 (iSTFT) for applying a short-time Fourier transform to the frequency domain separated source signals, thereby to transform them back into the time domain. Put another way, the iSTFT may be configured to perform the inverse operation to that performed by the STFT 101.

In other embodiments, the spatial synthesiser 108 may form part of a different device or apparatus to that which analyses and separates the array capture into its constituent sources. The operation of the spatial synthesiser 108 to regenerate the source signals is illustrated in Figure 3 by operation 308.

The output of the spatial synthesiser 108 may be provided for user consumption via a loudspeaker array 33 or a pair of headphones/headset having binaural rendering capabilities.

Iterative Spectrogram Parameter Estimation

As mentioned above with respect to operation 306 of Figure 3 and the spectrogram parameter estimator 106 of Figure 2, estimation of the spectrogram parameters which defines the source spectrogram is an iterative process. An example of this iterative estimation process 306 is illustrated in the flow chart of Figure 5.

First, in operation 306-1, the spectrogram parameter estimator 106 initializes the spectrogram parameters b q p tf A v q>n .

The initial values of the spectrogram parameters are calculated using spatial weights z n o p for each detected source which are scaled so that they sum to unity at each time frame (i.e.∑°=i ^η,ο,ρ = 1)» m which the detected source is considered as being active. In this way, during the frames in which a detected source is considered as being active, the equivalence of∑ p Sf n p = Sf n holds. When the source is considered as being inactive, the spectrogram parameters are calculated using spatial weights which are all set to zero. In order to model the background noise and diffuse sources, an additional background source may be added. The spatial weights of the additional background source may be set to one when∑ p ¾ 0 p < threshold and set to zero otherwise. The threshold may be experimentally determined to allow the detected and tracked sources to capture all spatial evidence within +-30 degrees from their estimated mean. With a background modeling strategy such as this, the detected tracked sources have exclusive priority to model all spatial evidence originating around the tracked mean, with exception of two direction-of-arrival trajectories intersecting in which case both sources are active at the same direction indices. After the spectrogram parameters are initialized in operation 306-1, the spatial weights z n o p calculated in operation 302 are subsequently utilized to optimize the spectrogram parameters.

Firstly in operation 306-2, X f n is calculated using equation 16 based on the

spectrogram parameters b q p t f q v q n as initialized in operation 306-1 and the spatial weights z n>0>p calculated in operation 302. As will be understood from the above description with regard to the parameter estimation, X n may be referred to as the complex-valued NMF model of the observed spatial covariance matrices. Next, in operation 306-3, a first of the spectrogram parameters (in this example, b q p - the soft decision of NMF component q belonging to source p) is updated using equation 17·

After updating the first of the spectrogram parameters, the spectrogram parameter estimator 106 (in operation 306-4) once again calculates Xf >n using equation 16. This time, however, the calculation is performed with the first spectrogram parameter having its updated value. The second and third spectrogram parameters (e.g.

tf A and v q n ) still have the values as initialized in operation 306-1 and the spatial weights z n o p remain as calculated in operation 302. Subsequently, in operation 306-5, a second of the spectrogram parameters (in this example, t - the magnitude spectrum of one NMF component q for each frequency bin f) is updated using equation 18.

After updating the second of the spectrogram parameters, the spectrogram parameter estimator 106 (in operation 306-6) once again calculates X fin using equation 16. This time, however, the calculation is performed with the first and second spectrogram parameters having their updated values. The third spectrogram parameter (e.g. v q n ) still has its value as initialized in operation 306-1 and the spatial weights z n o p remain as calculated in operation 302.

Subsequently, in operation 306-7, the third of the spectrogram parameters (in this example, v q n - the gain of the NMF component in each frame n) is updated using equation 19.

After updating the third of the spectrogram parameters, the spectrogram parameter estimator 106 (in operation 306-8) repeats operations 306-2 to 306-7 for a

predetermined number of iterations. As such, in the first repeat of operation 306-2, the spectrogram parameter estimator 106 once again calculates X fin using equation 16. This time, however, the calculation is performed with the all three spectrogram parameters having their updated values. The spatial weights ζ η>0,Ρ remain as calculated in operation 302. After the predetermined number of iterations have been performed, the spectrogram parameter estimator 106 proceeds to operation 306-9 and outputs the estimated spectrogram parameters for use by the source separator 107. The number of iterations may be between, for instance, 50 and 1000 and may depend the duration of the portion of audio signal that is currently being processed.

Some further details of components and features of the above-described apparatus and systems 1, 10 and alternatives for them will now be described.

The control (or audio signal processing) apparatus 20 comprises processing apparatus 21 communicatively coupled with memory 22. The memory 2 has computer readable instructions 23 stored thereon, which when executed by the processing apparatus 21 causes the processing apparatus 21 to cause performance of various ones of the operations described with reference to Figures 1 to 6. The control apparatus 20 may in some instance be referred to, in general terms, as "apparatus".

The processing apparatus 21 may be of any suitable composition and may include one or more processors 21A of any suitable type or suitable combination of types. For example, the processing apparatus 21 may be a programmable processor that interprets computer program instructions 23 and processes data. The processing apparatus may include plural programmable processors. Alternatively, the processing apparatus 21 may be, for example, programmable hardware with embedded firmware. The processing apparatus 21 may be termed processing means. The processing apparatus

21 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processing apparatus 21 may be referred to as computing apparatus.

The processing apparatus 21 is coupled to the memory (or one or more storage devices)

22 and is operable to read/write data 24 to/from the memory 22. The memory 22 may comprise a single memory unit or a plurality of memory units, upon which the computer readable instructions (or code) 23 is stored. For example, the memory 22 may comprise both volatile memory and non-volatile memory. For example, the computer readable instructions/program code 23 may be stored in the non-volatile memory and may be executed by the processing apparatus 21 using the volatile memory for temporary storage of data 24 or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories in general may be referred to as non-transitory computer readable memory media. The term 'memory', in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories. The computer readable instructions/program code 23 may be pre-programmed into the control apparatus 20. Alternatively, the computer readable instructions 23 may arrive at the control apparatus 20 via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions 23 may provide the logic and routines that enables the devices/apparatuses 1, 10, 20 to perform the functionality described above. The combination of computer-readable instructions stored on memory (of any of the types described above) may be referred to as a computer program product.

As will be appreciated, the apparatus/device 1 and or subsystem 10 described herein may include various hardware components which have may not been shown in the Figures. Similarly, the apparatuses 1, 20 may comprise further optional software components which are not described in this specification since they may not have direct interaction to embodiments of the invention. Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing apparatus" etc. should be understood to encompass not only computers having differing architectures such as single/multi -processor architectures and sequencers/parallel architectures, but also specialised circuits such as field

programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array,

programmable logic device, etc. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that flow diagrams of Figures 3 and 5 are examples only and that various operations depicted therein may be omitted, reordered and or combined.

Similarly, while various aspects and functions described with reference to Figures 2, 3 and 5 herein may have been described and illustrated with reference to discrete modules, units etc., it will be appreciated that this is for illustrative purposes only and it should be understood that this does not imply that these are physically discrete entities.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.