SOURCE SEPARATION - FRAUNHOFER GES FORSCHUNG

Title:

SOURCE SEPARATION

Document Type and Number:

WIPO Patent Application WO/2021/064204

Kind Code:

Abstract:

The present examples refer to methods, apparatus and techniques for obtaining a plurality of output signals associated with different sources (e.g. audio sources). In one example, it is possible to: combine a first input signal (502, M0), or a processed version thereof, with a delayed and scaled version (5031) of a second input signal (M1), to obtain a first output signal (504, S'0); and combine a second input signal (502, M1), or a processed version thereof, with a delayed and scaled version (5030) of the first input signal (M0), to obtain a second output signal (504, S'1). It is possible to determine, e.g. using a random direction optimization (560): scaling values (564, a0, a1), to obtain the delayed and scaled versions (5030) of the first and second input signals; and delay values (564, d0, d1), to obtain the delayed and scaled versions of the first and second input signals.

More Like This:

JP6793369	Imaging device
JP5607627	A signal processor and a signal processing method
WO/2001/017109	METHOD AND SYSTEM FOR ON-LINE BLIND SOURCE SEPARATION

Inventors:

SCHULLER GERALD (DE)

Application Number:

PCT/EP2020/077716

Publication Date:

April 08, 2021

Filing Date:

October 02, 2020

Export Citation:

Click for automatic bibliography generation Help

Assignee:

FRAUNHOFER GES FORSCHUNG (DE)
UNIV ILMENAU TECH (DE)

International Classes:

G10L21/0272; G06N3/12

Other References:

GOLOKOLENKO OLEG ET AL: "A Fast Stereo Audio Source Separation for Moving Sources", 2019 53RD ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, IEEE, 3 November 2019 (2019-11-03), pages 1931 - 1935, XP033750517, DOI: 10.1109/IEEECONF44664.2019.9048652
ALI KHANAGHA ET AL: "Selective tap training of FIR filters for Blind Source Separation of convolutive speech mixtures", INDUSTRIAL ELECTRONICS&APPLICATIONS, 2009. ISIEA 2009. IEEE SYMPOSIUM ON, IEEE, PISCATAWAY, NJ, USA, 4 October 2009 (2009-10-04), pages 248 - 252, XP031582379, ISBN: 978-1-4244-4681-0
BUCHNER H ET AL: "Trinicon: a versatile framework for multichannel blind signal processing", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2004. PROCEEDINGS. (ICASSP ' 04). IEEE INTERNATIONAL CONFERENCE ON MONTREAL, QUEBEC, CANADA 17-21 MAY 2004, PISCATAWAY, NJ, USA,IEEE, PISCATAWAY, NJ, USA, vol. 3, 17 May 2004 (2004-05-17), pages 889 - 892, XP010718333, ISBN: 978-0-7803-8484-2, DOI: 10.1109/ICASSP.2004.1326688
A. HYVARINENJ. KARHUNENE. OJA: "Microphone arrays, signal processing techniques and applications", 2001, JOHN WILEY & SONS
G. EVANGELISTAS. MARCHANDM. D. PLUMBLEYE. VINCENT: "DAFX: Digital Audio Effects", 2011, JOHN WILEY AND SONS, article "Sound source separation"
J. TARIQULLAHW. WANGD. WANG: "A multistage approach to blind separation of convolutive speech mixtures", SPEECH COMMUNICATION, vol. 53, 2011, pages 524 - 539
J. BENESTYJ. CHENE. A. HABETS: "Speech enhancement in the STFT domain", 2012, SPRINGER
J. JANSKYZ. KOLDOVSKYN. ONO: "A computationally cheaper method for blind speech separation based on auxiva and incomplete demixing transform", IEEE INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2016
D. KITAMURAN. ONOH. SAWADAH. KAMEOKAH. SARUWATARI: "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization", IEEE/ACM TRANS. ASLP, vol. 24, no. 9, 2016, pages 1626 - 1641
"Determined blind source separation with independent low-rank matrix analysis", 2018, SPRINGER, pages: 31
H.-C. WUJ. C. PRINCIPE: "Simultaneous diagonalization in the frequency domain (sdif) for source separation", PROC. ICA, 1999, pages 245 - 250
H. SAWADAN. ONOH. KAMEOKAD. KITAMURA: "Blind audio source separation on tensor representation", ICASSP, April 2018 (2018-04-01)
J. HARRISS. M. NAQVIJ. A. CHAMBERSC. JUTTEN: "Realtime independent vector analysis with student's t source prior for convolutive speech mixtures", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, ICASSP, April 2015 (2015-04-01)
H. BUCHNERR. AICHNERW. KELLERMANN: "Trinicon: A versatile framework for multichannel blind signal processing", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2004
J. CHUAG.WANGW. B. KLEIJN: "Convolutive blind source separation with low latency", ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), IEEE INTERNATIONAL WORKSHOP, 2016, pages 1 - 5, XP032983095, DOI: 10.1109/IWAENC.2016.7602895
W. KLEIJNK. CHUA: "Non-iterative impulse response shortening method for system latency reduction", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, pages 581 - 585, XP033258484, DOI: 10.1109/ICASSP.2017.7952222
I. SENESNICK: "Low-pass filters realizable as all-pass sums: design via a new flat delay filter", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, vol. 46, 1999, XP011012972
T. I. LAAKSOV. VALIMAKIM. KARJALAINENU. K. LAINE: "Splitting the unit delay", IEEE SIGNAL PROCESSING MAGAZINE, January 1996 (1996-01-01)
BEAMFORMING, 21 April 2019 (2019-04-21), Retrieved from the Internet
BENESTYSONDHIHUANG: "Handbook of speech processing", 2008, SPRINGER
J. P. THIRAN: "Recursive digital filters with maximally flat group delay", IEEE TRANS. ON CIRCUIT THEORY, vol. 18, no. 6, November 1971 (1971-11-01), pages 659 - 664
S. DASP. N. SUGANTHAN: "Differential evolution: A survey of the state-of-the-art", IEEE TRANS. ON EVOLUTIONARY COMPUTATION, vol. 15, no. 1, February 2011 (2011-02-01), pages 4 - 31
R. STORNK. PRICE: "Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces", JOURNAL OF GLOBAL OPTIMIZATION., vol. 11, no. 4, 1997, pages 341 - 359, XP008124915
DIFFERENTIAL EVOLUTION, 21 April 2019 (2019-04-21), Retrieved from the Internet
J. GAROFOLO ET AL., TIMIT ACOUSTIC-PHONETIC CONTINUOUS SPEECH CORPUS, 1993
MICROPHONE ARRAY SPEECH PROCESSING, 29 July 2019 (2019-07-29), Retrieved from the Internet
LLRMA, 29 July 2019 (2019-07-29), Retrieved from the Internet
R. B. STEPHENSA. E. BATE, ACOUSTICS AND VIBRATIONAL PHYSICS, 1966
J. B. ALLEND. A. BERKLEY: "Image method for efficiently simulating small room acoustics", J. ACOUST. SOC. AMER., vol. 65, 1979
C. FEVOTTER. GRIBONVALE. VINCENT: "Bss eval toolbox user guide", TECH. REP. 1706, IRISA TECHNICAL REPORT 1706, 2005
E. G. LEARNED-MILLER: "Entropy and Mutual Information", 2013, DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF MASSACHUSETTS
COMPARISON OF BLIND SOURCE SEPARATION TECHNIQUES, 29 July 2019 (2019-07-29), Retrieved from the Internet

Attorney, Agent or Firm:

BURGER, Markus et al. (DE)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. An apparatus (500) for obtaining a plurality of output signals (504, S’₀, S’i), associated with different sound sources (sourceo, sourcei), on the basis of a plurality of input signals (502, M₀, Mi), in which signals (S₀, Si) from the sound sources (source₀, sourcei) are combined (501), wherein the apparatus is configured to combine (510) a first input signal (502, Mo), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (Mi), to obtain a first output signal (504, S’o); wherein the apparatus is configured to combine (510) a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503o) of the first input signal (M₀), to obtain a second output signal (504, S'i); wherein the apparatus is configured to determine, using a random direction optimization (560): a first scaling value (564, a₀), which is used to obtain the delayed and scaled version (503o) of the first input signal (502, M₀); a first delay value (564, do), which is used to obtain the delayed and scaled version (503₀) of the first input signal (502, Mo); a second scaling value (564, ai), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi); and a second delay value (564, di), which is used to obtain the delayed and scaled version (503i) of the second input signal.

2. The apparatus of claim 1, wherein the delayed and scaled version (503i) of the second input signal (502, Mi), to be combined with the first input signal (502, M₀), is obtained by applying a fractional delay to the second input signal (502, Mi).

3. The apparatus of claim 1 or 2, wherein the delayed and scaled version (503o) of the first input signal (502, Mo), to be combined with the second input signal (502, Mi), is obtained by applying a fractional delay to the first input signal (502, M₀).

4. The apparatus of any of the preceding claims, configured to sum (714, 716) a plurality of products (712’, 710’) between: a respective element (Po) of a first set of normalized magnitude values, and - a logarithm (706’) of a quotient (702’) formed on the basis of: o the respective element (Po) of the first set of normalized magnitude values (522); and o a respective element (Pi) of a second set of normalized magnitude values (522), in order to obtain (530) a value (D_KL, D, 532) describing a similarity, or dissimilarity, between a signal portion (s’o)described by the first set of normalized magnitude values (P₀) and a signal portion (s’i) described by the second set of normalized magnitude values (Pi).

5. The apparatus of any of the preceding claims, wherein the random direction optimization (560) is such that candidate parameters form a candidates’ vector (564, ao, ai, do, di), wherein the candidates’ vector is iteratively refined by modifying the candidates’ vector in random directions.

6. The apparatus of any of the preceding claims, wherein the random direction optimization (560) is such that candidate parameters form a candidates’ vector, wherein the candidates’ vector is iteratively refined by modifying the candidates’ vector in random directions.

7. The apparatus of any of the preceding claims, wherein the random direction optimization (560) is such that a metrics and/or a value (D_KL, D, 532) indicating the similarity, or dissimilarity, between the first and second output signals is measured, and the first and second output measurements are selected to be those measurements associated to the candidate parameters associated to the value or metrics indicating lowest similarity, or highest dissimilarity.

8. An apparatus (500) for obtaining a plurality of output signals (504, S’o, S’I), associated with different sound sources (source-i, sourcez), on the basis of a plurality of input signals (502, Mo, Mi), in which signals from the sound sources (sourcei, source2) are combined (501), wherein the apparatus is configured to combine (510) a first input signal (502, M₀), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (502, Mi), to obtain a first output signal (504, S’o), wherein the apparatus is configured to apply a fractional delay (di) to the second input signal (502, Mi); wherein the apparatus is configured to combine (510) a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503₀) of the first input signal (502, Mo), to obtain a second output signal (504, SS), wherein the apparatus is configured to apply a fractional delay (d₀) to the first input signal (502, M₀); wherein the apparatus is configured to determine, using an optimization (560): a first scaling value (564, ao), which is used to obtain the delayed and scaled version (503₀) of the first input signal (502, M₀); a first fractional delay value (564, do), which is used to obtain the delayed and scaled version (503₀) of the first input signal (502, Mo); a second scaling value (564, ai), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi); and a second fractional delay value (564, di), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi).

9. The apparatus according to claim 8, wherein the optimization is a random direction optimization (560).

10. The apparatus of any of claims 8 or 9, configured to sum (714, 716) a plurality of products (710, 712) between: a respective element (Po) of a first set of normalized magnitude values, and - a logarithm (706’) of a quotient (702’) formed on the basis of: o the respective element (Po) of the first set of normalized magnitude values; and o a respective element (Pi) of a second set of normalized magnitude values, in order to obtain (530) a value (D_KL, D, 532) describing a similarity, or dissimilarity, between a signal portion (so’) described by the first set of normalized magnitude values (Po) and a signal portion (si’) described by the second set of normalized magnitude values (Pi).

11. An apparatus (500) for obtaining a plurality of output signals (504, S’o, S’i) associated with different sound sources (sourceo, sourcei) on the basis of a plurality of input signals (Mo, Mi), in which signals from the sound sources are combined, wherein the apparatus is configured to combine (510) a first input signal (502, Mo), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (502, Mi), to obtain a first output signal (504, S’o), wherein the apparatus is configured to combine (510) a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503o) of the first input signal (502, M₀), to obtain a second output signal (504, S’i), wherein the apparatus is configured to sum (714, 716) a plurality of products between: a respective element (P₀) of a first set of normalized magnitude values, and - a logarithm (706’) of a quotient (702’) formed on the basis of: o the respective element (P₀) of the first set of normalized magnitude values; and o a respective element (Pi) of a second set of normalized magnitude values, in order to obtain (530) a value (DKL, D, 532) describing a similarity, or dissimilarity, between a signal portion (s₀’) described by the first set of normalized magnitude values (P₀) and a signal portion (si’) described by the second set of normalized magnitude values.

12. The apparatus of claim 11 , configured to determine at least one of: a first scaling value (564, ai), which is used to obtain the delayed and scaled version of the first input signal (502, Mo), a first delay value (564, do), which is used to obtain the delayed and scaled version of the first input signal, a second scaling value (564, ai), which is used to obtain the delayed and scaled version of the second input signal, and a second delay value (564, di), which is used to obtain the delayed and scaled version of the second input signal, using an optimization.

13. The apparatus of claim 12, wherein the first delay value (do) is a fractional delay.

14. The apparatus of any of claims 11 to 13, wherein the second delay value (di) is a fractional delay.

15. The apparatus of any of claims 12 to 14, wherein the optimization is a random direction optimization (560).

16. The apparatus of any of the preceding claims, wherein at least one of the first and second scaling values and first and second delay values is obtained by minimizing the mutual information or related measure of the output signals.

17. The apparatus of any of the preceding claims, further comprising an optimizer (560) for iteratively performing the optimization, wherein the optimizer is configured, at each iteration, to randomly generate a current candidate vector for evaluating whether the current candidate vector performs better than a current best candidate vector, wherein the optimizer is configured to evaluate an objective function associated to a similarity, or dissimilarity, between physical signals, in association to the current candidate vector, wherein the optimizer is configured so that, in case the current candidate vector causes the objective function to be reduced with respect to the current best candidate vector, to render, as the new current best candidate vector, the current candidate vector.

18. The apparatus of any of the preceding claims, configured to: combine the first input signal (502, Mo), or a processed version thereof, with the delayed and scaled version (503i) of the second input signal (502, Mi) in the time domain and/or in the z transform or frequency domain; combine the second input signal (502, Mi), or a processed version thereof, with the delayed and scaled version (503o) of the first input signal (502, Mo) in the time domain and/or in the z transform or frequency domain.

19. The apparatus of any of claims preceding claims, wherein the optimization is performed in the time domain and/or in the z transform or frequency domain.

20. The apparatus of any of the preceding claims, wherein the delay or fractional delay (do) applied to the second input signal (502, Mi) is indicative of the relationship and/or difference or arrival between: the signal from the first source (sourceo) received by the first microphone (mico); and the signal from the first source (sourceo) received by the second microphone (mici).

21. The apparatus of any of the preceding claims, wherein the delay or fractional delay (di) applied to the first input signal (502, Mo) is indicative of the relationship and/or difference or arrival between: the signal from the second source (source-i) received by the second microphone (mici); and the signal from the second source (sourcei) received by the first microphone (mico).

22. The apparatus of any of the preceding claims, configured to perform an optimization (560) such that different candidate parameters (a₀, ai, do, di) are iteratively chosen and processed, and a metrics (532) is measured for each of the candidate parameters, wherein the metrics (532) is a similarity metrics, or dissimilarity metrics, so as to process and combine the first input signal (502, M₀) and the second input signal (502, Mi)) by using the candidate parameters (aO, a1, dO, d1) associated to the metrics indicating the lowest similarity of the output signals, or largest dissimilarity.

23. The apparatus of claim 22, wherein, for each iteration, the candidate parameters include a candidate delay (do) to be applied to the second input signal (502, Mi), the candidate delay (do) being associable to a candidate relationship and/or candidate difference or arrival between: the signal from the first source (sourceo) received by the first microphone (mic₀); and the signal from the first source (sourceo) received by the second microphone (mici).

24. The apparatus of claim 22 or 23, wherein, for each iteration, the candidate parameters include a candidate delay (di) to be applied to the first input signal (502, Mo), the candidate delay (di) being associable to a candidate relationship and/or candidate difference or arrival between: the signal from the second source (sourcei) received by the second microphone (mici); and the signal from the second source (sourcei) received by the first microphone (mic₀).

25. The apparatus of claim 22 or 23 or 24, wherein, for each iteration, the candidate parameters include a candidate relative attenuation value (564, a₀) to be applied to the second input signal (502, Mi), the candidate relative attenuation value (564, ao) being indicative of a candidate relationship and/or candidate difference between: the amplitude of the signal received by the first microphone (mico) from the first source (sourceo); and the amplitude of the signal received by the second microphone (mici) from the first source (sourceo).

26. The apparatus of claim 22 or 23 or 24 or 25, wherein, for each iteration, the candidate parameters (564) include a candidate relative attenuation value (ai) to be applied to the first input signal (502, Mo), the candidate relative attenuation value (ai) being indicative of a candidate relationship and/or candidate difference between: the amplitude of the signal received by the second microphone (mici) from the second source (sourcei); and the amplitude of the signal received by the first microphone (mico) from the second source (sourcei).

27. The apparatus of any of claims 22 to 26, configured to change at least one candidate parameter for different iterations.

28. The apparatus of any of claims 22 to 27, configured to change at least one candidate parameter for different iterations by randomly choosing at least one step from at least one candidate parameter for a preceding iteration to at least one candidate parameter for a subsequent iteration.

29. The apparatus of claim 28, configured to choose the at least one step randomly.

30. The apparatus of claim 29, wherein at least one step is weighted by a preselected weight.

31. The apparatus of claim 29 or 30, wherein the at least one step is limited by a preselected weight.

32. The apparatus of any of claims 22 to 31 , wherein candidate parameters (aO, a1 , dO, d1) form a candidates’ vector, wherein, for each iteration, the candidates’ vector is perturbed by applying a vector of random numbers, which are element-wise multiplied by, or added to, the elements of the candidates’ vector.

33. The apparatus of claim 32, wherein, for each iteration, the candidates' vector is modified for a step.

34. The apparatus of any of claims 22 to 33, wherein the number of iterations is limited to a predetermined maximum number.

35. The apparatus of any of claims 7 and 22 to 34, wherein the metrics (532) is processed as a Kullback-Leibler divergence.

36. The apparatus of any of claims 7 and 22 to 35, wherein the metrics (532) is processed using a Kullback-Leibler divergence.

37. The apparatus of any of claims 7 and 22 to 36, wherein the metrics is so that, the less the crosstalk, the higher its value.

38. The apparatus of any of any of claims 22 to 37, wherein the metrics (532) is based on: for each of the first and second signals (M₀, Mi), a respective element of a set of normalized magnitude values.

39. The apparatus of claim 38, wherein, for at least one of the first and second input signals (Mo, Mi), the respective element is based on the candidate first or second outputs signal (S o, S’i) as obtained from the candidate parameters.

40. The apparatus of claim 39, wherein for at least one of the first and second input signals (Mo, Mi), the respective element is obtained as a fraction between: a value associated to a candidate first or second output signal (S’₀, S’_I); and a norm associated to the previously obtained values of the first or second output signal.

41. The apparatus of claim 39 or 40, wherein for at least one of the first and second input signals (Mo, Mi), the respective element is obtained by

42. The apparatus of claim 41 , wherein P,(n) is comprised between 0 and 1.

43. The apparatus of any of claims 22 to 42, wherein the metrics includes a logarithm of a quotient formed on the basis of: o the respective element of the first set of normalized magnitude values; and o a respective element of a second set of normalized magnitude values, in order to obtain (530) a value (532) between a signal portion (s₀’) described by the first set of normalized magnitude values (Po) and a signal portion (si’) described by the first set of normalized magnitude values (P₀).

44. The apparatus of any of claims 7 or 22 to 43, wherein the metrics is obtained in form of: wherein P(n) is an element associated to the first input signal and Q(n) is an element associated to the second input signal.

45. The apparatus of any of claims 22 to 43, wherein the metrics is obtained in form of: wherein Pi(n) is an element associated to the first input signal and P₂(n) is an element associated to the second input signal.

46. The apparatus of any of claims 22 to 45, configured to perform the optimization using a sliding window.

47. The apparatus of any of the preceding claims, further configured to transform, into a frequency domain, information associated to the obtained first and second output signals (S’o, S’I). 48. The apparatus of any of the preceding claims, further configured to encode information associated to the obtained first and second output signals (S’o, S’i).

49. The apparatus of any of the preceding claims, further configured to store information associated to the obtained first and second output signals (S’o, S’_I).

50. The apparatus of any of the preceding claims, further configured to transmit information associated to the obtained first and second output signals (S’o, S’i).

51. The apparatus of any of the preceding claims and at least one of a first microphone (mico) for obtaining the first input signal (502, Mo) and a second microphone (mic-i) for obtaining the second input signal (502, Mi).

52. An apparatus for teleconferencing including the apparatus of any of the preceding claims and equipment for transmitting information associated to the obtained first and second output signals (S’o, S’_I).

53. A binaural system including the apparatus of any of the preceding claims.

54. An optimizer (560) for iteratively optimizing physical parameters associated to physical signals, wherein the optimizer is configured, at each iteration, to randomly generate a current candidate vector for evaluating whether the current candidate vector performs better than a current best candidate vector, wherein the optimizer is configured to evaluate an objective function associated to a similarity, or dissimilarity, between physical signals, in association to the current candidate vector, wherein the optimizer is configured so that, in case the current candidate vector causes the objective function to be reduced with respect to the current best candidate vector, to render, as the new current best candidate vector, the current candidate vector.

55. The optimizer of claim 54, wherein the physical signal includes audio signals obtained by different microphones.

56. The optimizer of claim 54 or 55, wherein the parameters include a delay and/or a scaling factor for an audio signal obtained at a particular microphone.

57. The optimizer of any of claims 54 to 56, wherein the objective function is a Kullback- Leibler divergence.

58. The optimizer of claim 57, wherein the Kullback-Leibler divergence is applied to a first and a second sets of normalized magnitude values.

59. The optimizer of claim 57 or 58, wherein the objective function is obtained by summing (714, 716) a plurality of products (712’, 710’) between: a respective element (Po) of a first set of normalized magnitude values, and - a logarithm (706’) of a quotient (702’) formed on the basis of: o the respective element (Po) of the first set of normalized magnitude values (522); and o a respective element (Pi) of a second set of normalized magnitude values (522), in order to obtain (530) a value (D«_L, D, 532) describing a similarity, or dissimilarity, between a signal portion (s’o)described by the first set of normalized magnitude values (Po) and a signal portion (S’I) described by the second set of normalized magnitude values (Pi).

60. The optimizer of claim 57 or 58 or 59, wherein the objective function is obtained as

Pg{n)

D(P₀, P₁) = - P₀(n) log n2 Pi(n) ^{+ Pl (n) Iog} (¾w)

(8) wherein Pi(n) or P(n) is an element associated to the first input signal and P_å(n) or Q(n) is an element associated to the second input signal.

61. A method for obtaining a plurality of output signals (504, S’o, S’i) associated with different sound sources (sourceo, sourcei) on the basis of a plurality of input signals (502, Mo, Mi), in which signals from the sound sources (sourceo, sourcei) are combined, the method comprising: combining a first input signal (502, M₀), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (502, Mi), to obtain a first output signal (504, S’o); combining a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503o) of the first input signal (502, Mo), to obtain a second output signal (504, S’-i); determining, using a random direction optimization (560), at least one of: a first scaling value (564, ao), which is used to obtain the delayed and scaled version (503o) of the first input signal (502, M₀); a first delay value (564, do), which is used to obtain the delayed and scaled version (503₀) of the first input signal (502, Mo); a second scaling value (564, ai), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi); and a second delay value (564, di), which is used to obtain the delayed and scaled version (503i) of the second input signal.

62. A method for obtaining a plurality of output signals (504, S o, S’i), associated with different sound sources (sourcei, sources), on the basis of a plurality of input signals (502, Mo, Mi), in which signals from the sound sources (sourcei, sources) are combined, the method including combining (510) a first input signal (502, Mo), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (502, Mi), to obtain a first output signal (504, S’o), wherein the method is configured to apply a fractional delay (di) to the second input signal (502, Mi); combining (510) a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503o) of the first input signal (502, Mo), to obtain a second output signal (504, S’_I), wherein the method is configured to apply a fractional delay (do) to the first input signal (502, Mo); determining, using an optimization, at least one of: a first scaling value (564, a₀), which is used to obtain the delayed and scaled version (503o) of the first input signal (502, M₀); a first fractional delay value (564, do), which is used to obtain the delayed and scaled version (503₀) of the first input signal (502, M₀); a second scaling value (564, ai), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi); and a second fractional delay value (564, di), which is used to obtain the delayed and scaled version (503i) of the second input signal (502, Mi).

63. A method for obtaining a plurality of output signals (504, S o, S’i) associated with different sound sources (sourceo, source^ on the basis of a plurality of input signals (M₀, Mi), in which signals from the sound sources are combined, combining a first input signal (502, Mo), or a processed version thereof, with a delayed and scaled version (503i) of a second input signal (502, Mi), to obtain a first output signal (504, S’₀), combining a second input signal (502, Mi), or a processed version thereof, with a delayed and scaled version (503o) of the first input signal (502, Mo), to obtain a second output signal (504, S’_I), summing (714, 716) a plurality of products between:

- a respective element (Po) of a first set of normalized magnitude values, and a logarithm (706’) of a quotient (702’) formed on the basis of: o the respective element (Po) of the first set of normalized magnitude values; and o a respective element (Pi, Q) of a second set of normalized magnitude values, in order to obtain (530) a value (532) describing a similarity, or dissimilarity, between a signal portion (s₀’) described by the first set of normalized magnitude values (Po) and a signal portion (si’(n)) described by the second set of normalized magnitude values (Po).

64. A method of any of claims 61-63, configured to use equipment of any of the preceding product claims.

65. A method according to any of claims 61-64, wherein the fractional delay (di) is indicative of the relationship and/or difference between the delay of the signal (M₀) arriving at the first microphone (mic₀) from the second source (source^ and the delay (Hi_,i) of the signal (Mi) arriving at the second microphone (mici) from the second (sourcei).

66. An optimizing method for iteratively optimizing physical parameters associated to physical signals, wherein the method includes, for each iteration, to randomly generate a current candidate vector for evaluating whether the current candidate vector performs better than a current best candidate vector, wherein the optimizer is configured to evaluate an objective function associated to a similarity, or dissimilarity, between physical signals, in association to the current candidate vector, wherein the optimizer is configured so that, in case the current candidate vector causes the objective function to be reduced with respect to the current best candidate vector, to render, as the new current best candidate vector, the current candidate vector. 67. A non-transitory storage unit storing instructions which, when executed by a processor, cause the processor to perform a method according to any of the claims 61-66.

Description:

Source Separation

Technical field

The present examples refer to methods and apparatus for obtaining a plurality of output signals associated with different sources (e.g. audio sources). The present examples also refer to methods and apparatus for signal separation. The present examples also refer to methods and apparatus for teleconferencing. Techniques for separation (e.g., audio source separation) are also disclosed. Techniques for fast time domain stereo audio source separation (e.g. using fractional delay filters) are also discussed.

Introductive discussion

Fig. 1 shows the setup of microphones indicated with 50a. The microphones 50a may include a first microphone mic ₀ and a second microphone mici which are here shown at a distance of 5 cm (50 mm) from each other. Other distances are possible. Two different sources (sourceo and sourcei) are here shown. As identified by angles bo and bi, they are placed in different positions (here also in different orientations with respect to each other).

A plurality of input signals M ₀ and Mi (from the microphones, also collectively indicated as a multi-channel, or stereo, input signal 502), are obtained from the sound sourceo and sourcei. While sourceo generates the audio sound indexed as So, sourcei generates the audio sound indexed as Si.

The microphone signals M ₀ and Mi may be considered, for example, as input signals. It is possible to consider a multi-channel with more than 2 channels instead of stereo signal 502.

The input signals may be more than two in some examples (e.g. other additional input channels besides M ₀ and Mi), even though here only two channels are mainly discussed. Notwithstanding, the present examples are valid for any multi-channel input signal. In examples, it is also not necessary that the signals Mo and Mi are directly obtained by a microphone, since they may be obtained, for example, from a stored audio file.

Figs. 2a and 4 show the interactions between the sources sourceo and sourcei and the microphones mico and mici. For example, the sourceo generates an audio sound So, which primarily reaches the microphone mico and also reaches the microphone mici. The same applies to sourcei, whose generated audio source Si primarily reaches the microphone mici and also reaches the microphone mico. We see from Figs. 2a and 4 that the sound So needs less time to reach at the microphone mico, than the time needed for reaching microphone mici. Analogously, the sound Si needs less time to arrive at mici, than the time it takes to arrive at mico. The intensity of the signal So, when reaching the microphone mic-i, is in general attenuated with respect to when reaching mico, and vice versa.

Accordingly, in the multi-channel input signal 502, the channel signals M ₀ and Mi are such that the signals So and Si from the sound sourceo and sourcei are combinations of each other. Separation techniques are therefore pursued.

Summary

Here below, text in square brackets and round brackets indicates non-limiting examples.

In accordance to an aspect, there is provided an apparatus [e.g. a multichannel or stereo audio source separation apparatus] for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [sourceo, sourcei] on the basis of a plurality of input signals [e.g. microphone signals], in which signals from the sound sources [sourceo, sourcei] are combined, wherein the apparatus is configured to combine a first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ai Z ^d1 Mi] of a second input signal [e.g. Mi] [e.g. by subtracting the delayed and scaled version of the second input signal from the first input signal, e.g. by S’o = Mo(z) - arz- ^d1-Mi(z)], to obtain a first output signal [S’o]; wherein the apparatus is configured to combine a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ao Z ^d0 Mo] of the first input signal [Mo] [e.g. by subtracting the delayed and scaled version of the first input signal from the second input signal, e.g. by S’i = Mi(z) - a ₀ z ^do Mo(z)], to obtain a second output signal [S’i]; wherein the apparatus is configured to determine, using a random direction optimization [e.g. by performing one of operations defined in other claims, for example; and/or by finding the delay and attenuation values which minimize an objective function, which could be, for example that in formulas (6) and/or (8)]: a first scaling value [ao], which is used to obtain the delayed and scaled version [ao-z ^_do-Mo] of the first input signal [Mo]; a first delay value [do], which is used to obtain the delayed and scaled version [ao Z ^d0 Mo] of the first input signal [M ₀]; a second scaling value [ai], which is used to obtain the delayed and scaled version [arz ^~d1-Mi] of the second input signal [Mi]; and a second delay value [di], which is used to obtain the delayed and scaled version of the second input signal [ai-z ^d1-MiJ.

The delayed and scaled version [arz ^d1-Mi] of the second input signal [Mi], may be combined with the first input signal [M ₀], is obtained by applying a fractional delay to the second input signal [Mi]

The delayed and scaled version [ao z ^d0 Mo] of the first input signal [Mo], may be combined with the second input signal [Mi], is obtained by applying a fractional delay to the first input signal [Mo].

The apparatus may sum a plurality of products [e.g., as in formula (6) or (8)] between: a respective element [Pi(n), with i being 0 or 1 ] of a first set of normalized magnitude values [e.g., as in formula (7)], and a logarithm of a quotient formed on the basis of: o the respective element [P(n) or Pi(n)] of the first set of normalized magnitude values; and o a respective element [Q(n) or Qi(n)] of a second set of normalized magnitude values, in order to obtain a value [DKL(P||Q) or D(P0,P1) in formulas (6) or (8)] describing a similarity [or dissimilarity] between a signal portion [so’(n)] described by the first set of normalized magnitude values [Po(n), for n=1 to ...] and a signal portion [si’(n)] described by the second set of normalized magnitude values [Pi(n), for n=1 to ...].

The random direction optimization may be such that candidate parameters form a candidates’ vector [e.g., with four entries, e.g. corresponding to a ₀, ai, do, di], wherein the vector is iteratively refined [e.g., in different iterations, see also claims 507ff.] by modifying the vector in random directions.

The random direction optimization may be such that candidate parameters form a candidates’ vector [e.g., with four entries, e.g. corresponding to aO, a1, dO, d1], wherein the vector is iteratively refined [e.g., in different iterations, see also below] by modifying the vector in random directions.

The random direction optimization may be such that a metrics and/or a value indicating the similarity (or dissimilarity) between the first and second output signals is measured, and the first and second output measurements are selected to be those measurements associated to the candidate parameters associated to the value or metrics indicating lowest similarity (or highest dissimilarity).

At least one of the first and second scaling values and first and second delay values may be obtained by minimizing the mutual information or related measure of the output signals.

In accordance to an aspect, there is provided an apparatus for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [sourcei, source2] on the basis of a plurality of input signals [e.g. microphone signals][ Mo, Mi], in which signals from the sound sources [sourcei, source ₂] are combined, wherein the apparatus is configured to combine a first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ai Z ^d1 Mi] of a second input signal [Mi], to obtain a first output signal [S’o], wherein the apparatus is configured to apply a fractional delay [di] to the second input signal [Mi] [wherein the fractional delay (di) may be indicative of the relationship and/or difference between the delay (e.g. delay represented by Hi _,0) of the signal (Hi _,0 Si) arriving at the first microphone (mico) from the second source (sourcei) and the delay (e.g. delay represented by Hi _,i) of the signal (Hi _,i Si) arriving at the second microphone (mici) from the second (sourcei)][in examples, the fractional delay di may be understood as approximating the exponent of the z term of the result of the fraction Hi _,o(z)/Hi _,i(z)]; wherein the apparatus is configured to combine a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ao Z ^d0 Mo] of the first input signal [M ₀], to obtain a second output signal [S’i], wherein the apparatus is configured to apply a fractional delay [d ₀] to the first input signal [M ₀] [wherein the fractional delay (do) may be indicative of the relationship and/or difference between the delay (e.g. delay represented by Ho _,o) of the signal (Ho.o So) arriving at the first microphone (mico) from the first source (source ₀) and the delay (e.g. delay represented by Ho _,i) of the signal (H ₀ rSo) arriving at the second microphone (mici) from the first source (sourceo)][ in examples, the fractional delay d ₀ may be understood as approximating the exponent of the z term of the result of the fraction H _0,i(z)/Ho _,o(z)]; wherein the apparatus is configured to determine, using an optimization: a first scaling value [a ₀], which is used to obtain the delayed and scaled version [ao Z ^d0-Mo] of the first input signal [Mo]; a first fractional delay value [do], which is used to obtain the delayed and scaled version [ao-z ^do-M ₀] of the first input signal [M ₀]; a second scaling value [ai], which is used to obtain the delayed and scaled version [arz ^d1 Mi] of the second input signal [Mi]; and a second fractional delay value [di], which is used to obtain the delayed and scaled version [avz ^d1-Mi] of the second input signal [Mi].

The optimization may be a random direction optimization.

In accordance to an aspect, there is provided an apparatus [e.g. a multichannel or stereo audio source separation apparatus] for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [sourceo, source^ on the basis of a plurality of input signals [e.g. microphone signals][Mo, Mi], in which signals from the sound sources are combined [e.g. by subtracting a delayed and scaled version of a second input signal from a first input signal and/or by subtracting a delayed and scaled version of a first input signal from a second input signal], wherein the apparatus is configured to combine a first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [arZ ^d1 Mi] of a second input signal [Mi] [e.g. by subtracting the delayed and scaled version of the second input signal from the first input signal], to obtain a first output signal [S’o], wherein the apparatus is configured to combine a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ao-z ^_d0 Mo] of the first input signal [Mo] [e.g. by subtracting the delayed and scaled version of the first input signal from the second input signal], to obtain a second output signal [S’I], wherein the apparatus is configured to sum a plurality of products [e.g., as in formula (6) or (8)] between: a respective element [Pi(n), with i being 0 or 1] of a first set of normalized magnitude values [e.g., as in formula (7)], and - a logarithm of a quotient formed on the basis of: o the respective element [P(n) or Pi (n)] of the first set of normalized magnitude values; and o a respective element [Q(n) or Qi(n)] of a second set of normalized magnitude values, in order to obtain a value [DKL(P||Q) or D(P0,P1) in formulas (6) or (8)] describing a similarity [or dissimilarity] between a signal portion [so’(n)] described by the first set of normalized magnitude values [P ₀(n), for n=1 to ...] and a signal portion [si’(n)] described by the second set of normalized magnitude values [Pi(n), for n=1 to ...].

The apparatus may determine: a first scaling value [ai], which is used to obtain the delayed and scaled version of the first input signal [Mo], a first delay value [do], which is used to obtain the delayed and scaled version of the first input signal, a second scaling value [a-i], which is used to obtain the delayed and scaled version of the second input signal, and a second delay value [di], which is used to obtain the delayed and scaled version of the second input signal, using an optimization [e.g. on the basis of a “modified KLD computation”]

The first delay value [do] may be a fractional delay. The second delay value [di] is a fractional delay.

The optimization may be a random direction optimization.

The apparatus may perform at least some of the processes in the time domain. The apparatus may perform at least some of the processes in the z transform or frequency domain.

The apparatus may be configured to: combine the first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with the delayed and scaled version [arZ ^d1 Mi] of the second input signal [Mi] in the time domain and/or in the z transform or frequency domain; combine the second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with the delayed and scaled version [ao Z ^do Mo] of the first input signal [Mo] in the time domain and/or in the z transform or frequency domain.

The optimization may be performed in the time domain and/or in the z transform or frequency domain.

The fractional delay (do) applied to the second input signal [Mi] may be indicative of the relationship and/or difference or arrival between: the signal [S ₀-Ho _,o(z)] from the first source [sourceo] received by the first microphone [micoj; and the signal [So-Ho _,i(z)] from the first source [sourceo] received by the second microphone [mici].

The fractional delay (di) applied to the first input signal [Mo] may be indicative of the relationship and/or difference or arrival between: the signal [SrHi _,i(z)] from the second source [sourcei] received by the second microphone [mici]; and the signal [SrHi _,0(z)] from the second source [sourcei] received by the first microphone [mico].

The apparatus may perform an optimization [e.g., the optimization such that different candidate parameters [ao, ai, do, di] are iteratively chosen and processed, and a metrics [e.g., as in formula (6) or (8)] [e.g. on the basis of a “modified KLD computation”][e.g., objective function] is measured for each of the candidate parameters, wherein the metrics is a similarity metrics (or dissimilarity metrics), so as to choose the first input signal [M ₀] and the second input signal [Mo]) obtained by using the candidate parameters [aO, a1, dO, d1] which associated to the metrics indicating the lowest similarity (or largest dissimilarity) [the similarity may be imagined as a statistic dependency between the first and second input signals (or values associated thereto, such as those in formula (7)), and/or the dissimilarity may be imagined as a statistic independency between the first and second input signals (or values associated thereto, such as those in formula (7)] For each iteration, the candidate parameters may include a candidate delay (do) [e.g., a candidate fractional delay] to be applied to the second input signal [Mi], the candidate delay (do) being associate to a candidate relationship and/or candidate difference or arrival between: the signal [So-Ho _,o(z)] from the first source [sourceo] received by the first microphone [mico]; and the signal [So-Ho _,i(z)] from the first source [sourceo] received by the second microphone [mici].

For each iteration, the candidate parameters include a candidate delay (di) [e.g., a candidate fractional delay] to be applied to the first input signal [M ₀], the candidate delay (di) being associable to a candidate relationship and/or candidate difference or arrival between: the signal [Si Hi _,i(z)] from the second source [sourcei] received by the second microphone [mici]; and the signal [Si Hi _,o(z)] from the second source [sourcei] received by the first microphone [mico].

For each iteration, the candidate parameters may include a candidate relative attenuation value [ao] to be applied to the second input signal [Mi], the candidate relative attenuation value [ao] being indicative of a candidate relationship and/or candidate difference between: the amplitude of the signal [S ₀ Ho _,o(z)] received by the first microphone [mico] from the first source [sourceo]; and the amplitude of the signal [So H _0,i(z)] received by the second microphone [mici] from the first source [sourceo].

For each iteration, the candidate parameters may include a candidate relative attenuation value [ai] to be applied to the first input signal [Mo], the candidate relative attenuation value [ai] being indicative of a candidate relationship and/or candidate difference between: the amplitude of the signal [SrHi _,i(z)] received by the second microphone [mici] from the second source [sourcei]; and the amplitude of the signal [Si Hi _,o(z)] received by the first microphone [mico] from the second source [sourcei].

The apparatus may change at least one candidate parameter for different iterations by randomly choosing at least one step from at least one candidate parameter for a preceding iteration to at least one candidate parameter for a subsequent iteration [e.g. , random direction optimization].

The apparatus may choose the at least one step [e.g., coeffvariation in line 10 of algorithm 1] randomly [e.g., random direction optimization].

The at least one step may be weighted by a preselected weight [e.g. coeffweights in line 5 of algorithm 1]

The at least one step is limited by a preselected weight [e.g. coeffweights in line 5 of algorithm 1].

The apparatus may be so that the candidate parameters [aO, a1, dO, d1] form a candidates’ vector, wherein, for each iteration, the candidates’ vector is perturbed [e.g., randomly] by applying a vector of uniformly distributed random numbers [e.g., each between -0.5 and +0.5], which are element-wise multiplied by (or added to) the elements of the candidates’ vector [it is possible to avoid gradient processing] [e.g., random direction optimization].

For each iteration, the candidates’ vector is modified (e.g., perturbed) for a step [e.g., which is each between -0.5 and +0.5]

The apparatus may be so that the numeric of iterations is limited to a predetermined maximum number, the predetermined maximum number being between 10 and 30 (e.g., 20, as in subsection 2.3, last three lines).

The metrics may be processed as a Kullback-Leibler divergence.

The metrics may be based on: for each of the first and second signals [Mo, Mi], a respective element [P,(n), with i being 0 or 1 ] of a first set of normalized magnitude values [e.g., as in formula (7)]. [a trick may be: considering the normalized magnitude values of the time domain samples as probability distributions, and after that measuring the metrics (e.g., as the Kullback-Leibler divergence, e.g. as obtained though formula (6) or (8))] For at least one of the first and second input signals [M ₀, Mi], the respective element [Pi(n)] may be based on the candidate first or second outputs signal [S’o, S’i] as obtained from the candidate parameters [e.g., like in formula (7)].

For at least one of the first and second input signals [Mo, Mi], the respective element [Pi(n)] may be based on the candidate first or second outputs signal [S’o, S’ _I] as obtained from the candidate parameters [e.g., like in formula (7)].

For at least one of the first and second input signals [M ₀, Mi], the respective element [Pi(n)] may be obtained as a fraction between: a value [e.g., absolute value] associated to a candidate first or second output signal [S’o(n), S’i(n)] [e.g., in absolute value]; and a norm [e.g., 1-norm] associated to the previously obtained values of the first or second output signal [S’o(...n-1), S’i(...n-1)].

For at least one of the first and second input signals [Mo, Mi], the respective element [Pi(n)] may be obtained by

(Here, “s’i (n)” and “sV are written without capital letters by virtue of not being, in this case, z transforms).

The metrics may include a logarithm of a quotient formed on the basis of: o the respective element [P(n) or Pi(n)] of the first set of normalized magnitude values; and o a respective element [Q(n) or Qi (n)] of a second set of normalized magnitude values, in order to obtain a value [DKL(P||Q) or D(P0,P1) in formulas (6) or (8)] describing a similarity [or dissimilarity] between a signal portion [so’(n)] described by the first set of normalized magnitude values [Po(n), for n=1 to ...] and a signal portion [si’(n)] described by the second set of normalized magnitude values [Pi(n), for n=1 to ...].

The metrics may be obtained in form of:

?! wherein P(n) is an element associated to the first input signal [e.g., Pi(n) or element of the first set of normalized magnitude values] and Q(n) is an element associated to the second input signal [e.g., P ₂(n) or element of the second set of normalized magnitude values].

The metrics may be obtained in form of: wherein Pi(n) is an element associated to the first input signal [e.g., Pi(n) or element of the first set of normalized magnitude values] and P2(n) is an element associated to the second input signal [e.g., element of the second set of normalized magnitude values].

The apparatus may perform the optimization using a sliding window [e.g., the optimization may take into account TD samples of the last 0.1s...1.Os].

The apparatus may transform, into a frequency domain, information associated to the obtained first and second output signals (S’ ₀, S’i).

The apparatus may encode information associated to the obtained first and second output signals (S’o, S’ _I).

The apparatus may store information associated to the obtained first and second output signals (S’o, S’ _I).

The apparatus may transmit information associated to the obtained first and second output signals (S’o, S’ _I).

The apparatus of any of the preceding claims may include at least one of a first microphone (mico) for obtaining the first input signal [Mo] and a second microphone (mici) for obtaining the second input signal [Mi] [e.g., at a fixed distance] An apparatus for teleconferencing may be provided, including the apparatus as above and equipment for transmitting information associated to the obtained first and second output signals (S’o, S’i).

A binaural system is disclosed including the apparatus as above.

An optimizer is disclosed for iteratively optimizing physical parameters associated to physical signals, wherein the optimizer is configured, at each iteration, to randomly generate a current candidate vector for evaluating whether the current candidate vector performs better than a current best candidate vector, wherein the optimizer is configured to evaluate an objective function associated to a similarity, or dissimilarity, between physical signals, in association to the current candidate vector, wherein the optimizer is configured so that, in case the current candidate vector causes the objective function to be reduced with respect to the current best candidate vector, to render, as the new current best candidate vector, the current candidate vector.

The physical signal may include audio signals obtained by different microphones.

The parameters may include a delay and/or a scaling factor for an audio signal obtained at a particular microphone.

The objective function is a Kullback-Leibler divergence. The Kullback-Leibler divergence may be applied to a first and a second sets of normalized magnitude values.

The objective function may be obtained by summing a plurality of products [e.g., as in formula (6) or (8)] between:

- a respective element [P,(n), with i being 0 or 1] of a first set of normalized magnitude values [e.g., as in formula (7)], and a logarithm of a quotient formed on the basis of: o the respective element [P(n) or Pi(n)] of the first set of normalized magnitude values; and o a respective element [Q(n) or Qi (n)] of a second set of normalized magnitude values, in order to obtain a value [DKL(P||Q) or D(P0,P1) in formulas (6) or (8)] describing a similarity [or dissimilarity] between a signal portion [s ₀’(n)] described by the first set of normalized magnitude values [Po(n), for n=1 to ...] and a signal portion [si’(n)] described by the second set of normalized magnitude values [Pi(n), for n=1 to ...].

The objective function may be obtained as wherein Pi(n) or P(n) is an element associated to the first input signal [e.g., Pi(n) or element of the first set of normalized magnitude values] and P2(n) or Q(n) is an element associated to the second input signal.

In accordance to an example, there is provided a method for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [sourceo, source^ on the basis of a plurality of input signals [e.g. microphone signals][Mo, Mi], in which signals from the sound sources [sourceo, sourcei] are combined, the method comprising: combining a first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [arz ^d1-Mi] of a second input signal [Mi] [e.g. by subtracting the delayed and scaled version of the second input signal from the first input signal, e.g. by S’o = Mo(z) - arZ ^d1-Mi(z)], to obtain a first output signal [S’o]; combining a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [a ₀ z ^do Mo] of the first input signal [Mo] [e.g. by subtracting the delayed and scaled version of the first input signal from the second input signal, e.g. by S’I = Mi(z) - ao Z ^d0-Mo(z)], to obtain a second output signal [S’ _I]; determining, using a random direction optimization [e.g. by performing one of operations defined in other claims, for example; and/or by finding the delay and attenuation values which minimize an objective function, which could be, for example that in formulas (6) and/or (8)]: a first scaling value [a ₀], which is used to obtain the delayed and scaled version [ao*z ^do*M ₀] of the first input signal [Mo]; a first delay value [do], which is used to obtain the delayed and scaled version [ao*z ^"d0*Mo] of the first input signal [M ₀]; a second scaling value [a-i], which is used to obtain the delayed and scaled version [ai*z ^d1*Mi] of the second input signal [Mi]; and a second delay value [di], which is used to obtain the delayed and scaled version of the second input signal [ai*z ^d1*Mi].

In accordance to an example, there is provided a method method for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [source^ source ₂] on the basis of a plurality of input signals [e.g. microphone signals][ Mo, Mi], in which signals from the sound sources [sourcei, source ₂] are combined, the method including combining a first input signal [Mo], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [a-i ^*z ^d1*Mi] of a second input signal [M-i], to obtain a first output signal [S’o], wherein the method is configured to apply a fractional delay [di] to the second input signal [Mi] [wherein the fractional delay (di) may be indicative of the relationship and/or difference between the delay (e.g. delay represented by H-i,o) of the signal (Hi _,o ^*Si) arriving at the first microphone (mico) from the second source (sourcei) and the delay (e.g. delay represented by Hi _,i) of the signal (Hi _,i*Si) arriving at the second microphone (mici) from the second (sourcei)][in examples, the fractional delay di may be understood as approximating the exponent of the z term of the result of the fraction Hi _,o(z)/Hi _,i(z)]; combining a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ao ^*z ^d0*M ₀] of the first input signal [Mo], to obtain a second output signal [S’ _I], wherein the method is configured to apply a fractional delay [do] to the first input signal [M ₀] [wherein the fractional delay (do) may be indicative of the relationship and/or difference between the delay (e.g. delay represented by Ho _,o) of the signal (Ho,o ^*So) arriving at the first microphone (mico) from the first source (sourceo) and the delay (e.g. delay represented by H _0,i) of the signal (H _0,i*So) arriving at the second microphone (mici) from the first source (sourceo)][ in examples, the fractional delay do may be understood as approximating the exponent of the z term of the result of the fraction Ho,i(z)/Ho,o(z)]; determining, using an optimization: a first scaling value [a _D], which is used to obtain the delayed and scaled version [a _0*z ^d0*M ₀] of the first input signal [Mo]; a first fractional delay value [do], which is used to obtain the delayed and scaled version [ao ^*z ^do*Mo] of the first input signal [Mo]; a second scaling value [ai], which is used to obtain the delayed and scaled version [ai ^*z ^~d1*Mi] of the second input signal [Mi]; and a second fractional delay value [di], which is used to obtain the delayed and scaled version [ai*z ^d1*Mi] of the second input signal [Mi]

In accordance to an example, there is provided a method for obtaining a plurality of output signals [S’o, S’i] associated with different sound sources [sourceo, sourcei] on the basis of a plurality of input signals [e.g. microphone signals][M ₀, Mi], in which signals from the sound sources are combined [e.g. by subtracting a delayed and scaled version of a second input signal from a first input signal and/or by subtracting a delayed and scaled version of a first input signal from a second input signal], combining a first input signal [M ₀], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [ai*z ^d1*Mi] of a second input signal [Mi] [e.g. by subtracting the delayed and scaled version of the second input signal from the first input signal], to obtain a first output signal [S’o], combining a second input signal [Mi], or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version [a _0*z ^do*Mo] of the first input signal [Mo] [e.g. by subtracting the delayed and scaled version of the first input signal from the second input signal], to obtain a second output signal [S’ _I], summing a plurality of products [e.g., as in formula (6) or (8)] between: a respective element [Pi(n), with i being 0 or 1] of a first set of normalized magnitude values [e.g., as in formula (7)], and a logarithm of a quotient formed on the basis of: o the respective element [P(n) or Pi(n)] of the first set of normalized magnitude values; and o a respective element [Q(n) or Qi(n)] of a second set of normalized magnitude values, in order to obtain a value [DKL(P||Q) or D(P0,P1) in formulas (6) or (8)] describing a similarity [or dissimilarity] between a signal portion [s ₀’(n)] described by the first set of normalized magnitude values [Po(n), for n=1 to ...] and a signal portion [si’(n)] described by the second set of normalized magnitude values [Pi(n), for n=1 to ...].

In accordance to an example, there is provided a method of any of the preceding method claims, configured to use equipment as above or below. A non-transitory storage unit storing instructions which, when executed by a processor, cause the processor to perform a method according to any of the preceding method claims.

Brief description of the figures

Fig. 1 shows a layout of microphones and sources useful to understand the invention;

Fig. 2a shows a functioning technique according to the present invention;

Fig. 2b shows a signal block diagram of convulsive mixing and mixing process;

Fig. 3 shows performance evaluation of BSS algorithm applied to simulated data;

Fig. 4 shows a layout of microphones and sound sources useful to understand the invention; Fig. 5 shows an apparatus according to the invention;

Figs. 6a, 6b and 6c show results obtainable with the invention; and Fig. 7 shows elements of the apparatus of Fig. 5.

Description of examples

It has been understood that by applying the techniques such as those discussed above and below, a signal may be processed so as to arrive at a plurality of signals S[ and S _Q separated with each other. Therefore, the result is that the output signal is not affected (or negligibly or minimally affected) from the sound So, while the output signal S _Q' is not affected (or minimally or negligibly affected) by the effects of the sound Si onto the microphone mico.

An example is provided by Fig. 2b, showing a model of the physical relationships between the generated sounds So and Si and the signal 502 as collectively obtained from the microphones M ₀ and Mi. The results are here represented in the z transform (which in some cases is not indicated for the sake of brevity). As can be seen from block 501 , the sound signal So is subjected to a transfer function Ho _,o(z) which is summed to the sound signal Si (modified by a transfer function Hi _,o(z)). Accordingly, the signal Mo(z) is obtained at microphone mico and keeps into account, unwantedly, the sound signal Si(z). Analogously, the signal Mi(z) as obtained at the microphone mici includes both a component associated to the sound signal Si(z) (obtained through a transfer function Hi _,i(z)) and a second, unwanted component caused by the sound signal So(z) (after having been subjected to the transfer function H _0,i(z)). This phenomenon is called crosstalk.

In order to compensate for the crosstalk, the solution indicated at block 510 may be exploited. Here, the multi-channel output signal 504 includes: • a first output signal S'(z) (representing the sound So collected at microphone mico but polished from the crosstalk), which includes at least the two components: o the input signal Mo and o a subtractive component 503i (which is a delayed and/or scaled version of the signal Mi and which may be being obtained by subjecting the signal Mi to the transfer function —a z ^~dl)

• an output signal S[ (z) (representing the sound Si collected at microphone mici but polished from the crosstalk) which includes: o the input signal Mi and o a subtractive component 503o (which is a delayed and/or scaled version of the first input signal Mo as obtained at the microphone mico and which may be obtained by subjecting the signal M ₀ to the transfer function -a ₀ z ^~d°).

The mathematical explanations are provided below, but it may be understood that the subtractive components 503i and 503o at block 510 compensate for the unwanted components caused at block 501. It is therefore clear that block 510 permits to obtain a plurality (504) of output signals (S o, S’i), associated with different sound sources (sourceo, sourcei), on the basis of a plurality (502) of input signals [e.g. microphone signals][(M ₀, Mi), in which signals (So, Si) from the sound sources (sourceo, source^ are (unwantedly) combined (501). The block 510 may be configured to combine (510) the first input signal (Mo), or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version (503i) [arZ ^d1 Mi] of the second input signal (Mi) [e.g. by subtracting the delayed and scaled version of the second input signal from the first input signal, e.g. by S’o(z) = Mo(z) - ai Z ^_d1 Mi(z)], to obtain a first output signal (S’o); wherein the block is configured to combine (510) a second input signal (Mi), or a processed [e.g. delayed and/or scaled] version thereof, with a delayed and scaled version (503o) [ao Z ^d0 Mo] of the first input signal [Mo] [e.g. by subtracting the delayed and scaled version of the first input signal from the second input signal, e.g. by S’i(z) = Mi(z) - ao-z ^do M ₀(z)], to obtain a second output signal [S’i]

While the z transform is particularly useful in this case, it is notwithstanding possible to make use of other kinds of transforms or to directly operate in the time domain.

Basically, it may be understood that a couple of scaling values ao and ai modify the amplitude of the subtractive components 503i and 503 ₀ to obtain a scaled version of the input signals, and the delays do and di may be understood as fractional delays. In examples, the fractional delay do may be understood as approximating the exponent of the z term of the result of the fraction Ho _,i(z)/Ho _,o(z)j. The fractional delay di may be indicative of the relationship and/or difference between the delay (e.g. delay represented by Hi _,0) of the signal (Hi _,0 Si) arriving at the first microphone (mico) from the second source (sourcei) and the delay (e.g. delay represented by Hi _,i) of the signal (Hi _,rSi) arriving at the second microphone (mic-i) from the second (sourcei). In examples, the fractional delay di may be understood as approximating the exponent of the z term of the result of the fraction Hi ₀(z)/Hi _,i(z)]. The fractional delay do may be indicative of the relationship and/or difference between the delay (e.g. delay represented by H _0,o) of the signal (Ho.o So) arriving at the first microphone (mico) from the first source (source ₀) and the delay (e.g. delay represented by Ho _,i) of the signal (Ho.vSo) arriving at the second microphone (mici) from the first source (sourceo)][in examples, the fractional delay d ₀ may be understood as approximating the exponent of the z term of the result of the fraction Ho _,i(z)/Ho _,o(z)].

As it will be explained subsequently, it is possible to find the most preferable values (also collectively indicated with the reference numeral 564), in particular:

• a first scaling value [ao], e.g., which is used to obtain the delayed and scaled version 503o [a _0'Z ^do-Mo] of the first input signal [502, M ₀];

• a first fractional delay value [do], e.g., which is used to obtain the delayed and scaled version 503 ₀ [ao-z ^~do-Mo] of the first input signal [502, M ₀];

• a second scaling value [ai], e.g., which is used to obtain the delayed and scaled version 503i [avZ ^d1- i] of the second input signal [502, M-i]; and

• a second fractional delay value [di], e.g., which is used to obtain the delayed and scaled version 503i [arZ ^d1 Mi] of the second input signal [502, Mi]

Techniques for obtaining the most preferable scaling values a ₀ and ai and delay values do and di are here discussed, particularly with reference to Fig. 5. As can be seen from Fig. 5, a stereo or multiple channel signal 502 (including the inputs signals M ₀(z) and Mi(z)) is obtained. As can be seen, the method may be iterative, in the sense that it is possible to cycle along multiple iterations for obtaining the best values of the scaling values and the delay values to be adopted. Fig. 5 shows an output 504 formed by signals S _Q(Z) and S (z) which are optimized, e.g. after multiple iterations. Fig. 5 shows the mixing block 510, which may be the block 510 of Fig. 2b.

The multichannel signal 510 (including its channel components, i.e. the multiple input signals S0(z) and S (z)) is thus obtained by making use of scaling values a ₀ and ai and delay values do and di, which are more and more optimized along the iterations.

At block 520, normalizations are performed to the signals S _Q' (z) andS^ (z). An example of normalization is provided by formula (7), represented as the following quotient:

Here, i = 0,1, indicating that there is a normalized value Po(n) for the input signal Mo and a normalized value Pi(n) for the input signal Mi. The index n is the time index of the time domain input signal. Here, s (n) is the time domain sample index (it is not a z transform) of the signal M, (with i=0, 1). |$· (h)| indicates that the magnitude (e.g. absolute value) of s · (n) obtained and is therefore positive or, at worse, 0. This implies that the numerator in formula (7) is positive or, at worse, 0. indicates that the denominator in formula (7) is formed by the 1-norm of the vector s·. The 1-norm || | ... indicates the sum of the magnitudes |s-(n)|, where n goes over the signal samples, e.g. up to the present index (e.g., the signal samples may be taken within a predetermined window from a past index to the present index). Hence, Ills' |||c (which is the denominator in formula (7)) is positive (or is 0 in some cases). Moreover, it is always |s-(n)| < || l^i I H , which implies that 0£P,(n)£1 (i=0,1). Further, also the following is verified:

It has been therefore noted that Po(n) and Pi(n) can be artificially considered as probabilities since, by adopting equation (7), they verify:

1. Pi(n) ³ 0, Vn 2. ån=o P i (n) = 1 with i = 0,1 (further discussion is provided here below). “¥” is used for mathematical formalism, but can approximated over the considered signal.

It is noted that other kinds of normalizations may be provided, and not only those obtained through formula (7).

Fig. 5 shows block 530 which is input by the normalized values 522 and outputs a similarity value (or the similarity value) 532, giving information between the first and second input values M ₀and Mi. Block 530 may be understood as a block which measures a metrics that gives an indication of how much the input signals Mo and Mi are similar (or dissimilar) to each other.

It has been understood that the metrics chosen for indicating the similarity or dissimilarity between the first and second input values may be the so-called Kullback-Leibler Divergence (KLD). This can be obtained using formulas (6) or (8):

¾ _L( ||<2) = å (n) log ^j) (6)

A discussion on how to obtain the Kullback-Leibler Divergence (KLD) is now provided. Fig. 7 shows an example of block 530 downstream to block 520 of Fig. 5. Block 520 therefore provides P ₀(n) and P _x(n) (522), e.g. using the formula (7) as discussed above (other techniques may be used). Block 530 (which may be understood as a Kullback-Leibler processor or KL processor) may be adapted to obtain a metrics 532, which is in this case the Kullback-Leibler Divergence as calculated in formula (8).

With reference to Fig. 7, at a first branch 700a, a quotient 702’ between P ₀(n) and P _t(n) is calculated at block 702. At block 706, a logarithm of the quotient 702’ is calculated, hence, obtaining the value 706’. Then, the logarithm value 706’ may be used for scaling the normalized value P ₀ at scaling block 710, hence, obtaining a product 710’. At a second branch 700b, a quotient 704’ is calculated at block 704. The logarithm 708' of the quotient 704’ is calculated at block 704. Then, the logarithm value 708’ is used for scaling the normalized value at scaling block 712, hence, obtaining the product 712’.

At adder block 714, the values 710’ and 712’ (as respectively obtained at branches 700a and 700b) are combined to each other. The combined values 714’ are summed with each other and along the sample domain indexes at block 716. The added values 716’ may be inverted at block 718 (e.g., scaled by -1) to obtain the inverted value 718’. It is to be noted that, while the value 716’ can be understood as a similarity value, the inverted value 718' can be understood as a dissimilarity value. Either the value 716’ or the value 718’ may be provided as metrics 532 to the optimizer 560 as explained above (value 716’ indicating similarity, value 718’ indicating dissimilarity).

Hence, the optimizer block 530 may therefore permit to arrive at formula (8), i.e. D(P ₀, P _x) = order to arrive at formula (6), e.g. D _KL, it could simply be possible to eliminate, from Fig. 7, blocks 704, 708, 712 and 714, and substitute P ₀ with P and P _x with Q.

The Kullback-Leibler divergence was natively conceived for giving measurements regarding probabilities and is in principle, unrelated to the physical significance of the input signals M ₀ and Mi. Notwithstanding, the inventors have understood that, by normalizing the signals S _Q and Si and obtaining normalized values such as P ₀(n) and R (h), the Kullback-Leibler Divergence provides a valid metrics for measuring the similarity/dissimilarity between the input signals Mo and Mi. Hence, it is possible to consider the normalized magnitude values of the time domain samples as probability distributions, and after that, it is possible to measure the metrics (e.g., as the Kullback-Leibler divergence, e.g. as obtained though formula (6) or (8)).

Reference is now made to Fig. 5 again. For each iteration, the metrics 532 provides a good estimate of the validity of the scaling values ao and ai and the delay values d ₀ and di. Along the iterations, the different candidate values for the scaling values ao and ai and the delay values do and di will be chosen among those candidates, which presents the lowest similarity or highest dissimilarity.

Block 560 (optimizer) is input by the metrics 532 and outputs candidates 564 (vector) for the delay values do and di and the scaling values a ₀ and ai. The optimizer 560 may measure the different metrics obtained for different groups of candidates ao, ai, do, di, change them, and choose the group of candidates associated to the lowest similarity (or highest dissimilarity) 532. Hence, the output 504 (output signals S _Q' (z will provide the best approximation. The candidate values 564 may be grouped in a vector, which can be subsequently modified, for example, through a random technique (Fig. 5 shows a random generator 540 providing a random input 542 to the optimizer 560). The optimizer 560 may make use of weights through which the candidate values 564 (ao, ai, do, di) are scaled (e.g., randomly). Initial coefficient weights 562 may be provided, e.g., by default. An example of processing of the optimizer 560 is provided and discussed profusely below (“algorithm 1”). Possible correspondences between the lines of the algorithm and elements of Fig. 5 are also shown in Fig. 5.

As may be seen, the optimizer 564 outputs a vector 564 of values ao, a-\, do, di, which are subsequently reused at the mixing block 510 for obtaining new values 512, new normalized values 522, and new metrics 532. After a certain number of iterations (which could be for example predefined) a maximum numbers of iterations may be, for example, a number chosen between 10 and 20. Basically, the optimizer 560 may be understood as finding the delay and iteration values, which minimize an objective function, which could be, for example, the metrics 532 obtained at block 530 and/or using formulas (6) and (8).

It has been therefore understood that the optimizer 560 may be based on a random direction optimization technique, such that candidate parameters form a candidates’ vector [e.g., with four entries, e.g. corresponding to 564, ao, ai, do, di], wherein the candidates’ vector is iteratively refined by modifying the candidates’ vector in random directions.

Basically, the candidates’ vector (indicated the subsequent values of ao, ai, do, d-i) may be iteratively refined by modifying candidate vectors in random directions. For example, following the random input 542, different candidate values may be modified by using different weights that vary randomly. Random directions may mean, for example, that while some candidate values are increased, other candidate values are decreased, or vice versa, without a predefined rule. Also the increments of the weights may be random, even though a maximum threshold may be predefined.

The optimizer 460 may be so that candidate parameters [a ₀, ai, d ₀, di] form a candidates’ vector, wherein, for each iteration, the candidates’ vector is perturbed [e.g., randomly] by applying a vector of uniformly distributed random numbers [e.g., each between -0.5 and +0.5], which are element-wise multiplied by (or added to) the elements of the candidates’ vector. It is therefore possible to avoid gradient processing, e.g., by using random direction optimization. Hence, by randomly perturbing the vector of coefficients, it is possible to arrive, step by step, to the preferable values of ao, ai, do, di and to the output signal 504 in which the combined sounds So and Si are appropriately compensated. The algorithm is discussed below with a detailed description.

In the present examples, reference is always made to a multi-channel input signal 502 formed by two input channels (e.g., Mo, Mi). Anyway, the same examples above also apply also for more than two channels.

In the examples, the logarithms may be in any base. It may be imagined that the base discussed above is 10.

Detailed discussion of the technique

One goal is a system for teleconferencing, for the separation of two speakers, or a speaker and a musical instrument or noise source, in a small office environment, not too far from a stereo microphone, as in available stereo webcams. The speakers or sources are assumed to be on opposing (left-right) sides of the stereo microphone. To be useful in real time teleconferencing, we want it to work online with as low delay as possible. For comparison, in this paper we focus on an offline implementation. Proposed approach works in time domain, using attenuation factors and fractional delays between microphone signals to minimize cross-talk, the principle of a fractional delay and sum beamformer. Compared to other approaches this has the ad-vantage that we have a lower number of variables to optimize, and we dont have the permutation problem of ICA like approaches in the frequency domain. To optimize the separation, we minimize the negative Kullback-Leibler derived objective function between the resulting separated signals. For the optimization we use a novel algorithm of "random directions”, without the need for gradients, which is very fast and robust. We evaluate our approach on convolutive mixtures generated from speech signals taken from the TIMIT data-set using a room impulse response simulator, and with real-life recordings. The results show that for the proposed scenarios our approach is competitive with regard to its separation performance, with a lower computational complexity and system delay to the Prior-Art approaches. Index Terms — Blind source separation, time domain, binaural room impulse responses, optimization

1. INTRODUCTION, PREVIOUS APPROACHES

Our system is for applications where we have two microphones and want to separate two audio sources. This could be for instance a teleconferencing scenario with a stereo webcam in an office and two speakers around it, or for hearing aids, where low computational complexity is important.

Previous approaches: An early previous approach is Independent Components Analysis (ICA). It can unmix a mix of signals with no delay in the mixture. It finds the coefficients of the unmixing matrix by maximizing non-gaussianity or maximizing the Kuilback-Leibler Divergence [1 , 2] But for audio signals and a stereo microphone pair we always have propagation delay, in general convolutions with the room impulse responses [3], in the mix. Approaches to deal with it often apply the Short Time Fourier Transform (STFT) to the signals [4], e.g., AuxIVA [5] and ILRMA [6, 7]. This converts the signal delay into a complex valued factors in the STFT subbands, and a (complex valued) ICA can be applied in the resulting subbands (e.g. [8]).

Problem: A problem that occurs here is a permutation in the sub-bands, the separated sources can appear in different orders in different subbands; and the gain for different sources in different subbands might be different, leading to a modified spectral shape, a spectral flattening. Also we have a signal delay resulting from applying an STFT. It needs the assembly of the signal into blocks, which needs a system delay corresponding to the block size [9, 10].

Time domain approaches, like TRINICON [11], or approaches that use the STFT with short blocks and more microphones [12, 13], have the advantage that they don’t have a large blocking delay of the STFT, but they usually have a higher computational complexity, which makes them hard to use on small devices.

See also Fig. 1 , showing a setup of loudspeakers and microphones in the simulation 2. PROPOSED APPROACH

To avoid the processing delays associated with frequency domain approaches, we use a time domain approach. Instead of using FIR filters, we use HR filters, which are implemented as fractional delay allpass filters [14, 15], with an attenuation factor, the principle of a fractional delay and sum or adaptive beamformer [16, 17, 18]. This has the advantage that each such filter has only 2 coefficients, the fractional delay and the attenuation. For the 2-channel stereo case this leads to a total of only 4 coefficients, which are then easier to optimize. For simplicity, we don’t do a dereverberation either, we focus on the crosstalk minimization. In effect we model the Relative T ransfer Function between the two microphones by an attenuation and a pure fractional delay. We then apply a novel optimization of ’’random directions”, similar to the ’’Differential Evolution” method.

We assume a mixture recording from 2 sound sources (S ₀ and 5 _X) made with 2 microphones (M ₀ and M ). However, the same result are also valid for more than two sources. The sound sources may be assumed to be in fixed positions as shown in Fig. 1. In order to avoid the need for modeling of non-causal impulse responses the sound sources have to be in different half-planes of the microphone pair (left- right).

Instead of the commonly used STFT, we may use the z-transform for the mathematical derivation, because it does not need the decomposition of the signal into blocks, with its associated delay. This makes it suitable for a time domain implementation with no algorithmic delay. Remember the (1 -sided) z-transform of a time domain signal x(n), with sample index n, is defined as X (z) = We use capital letter to denote z-transform domain signals.

Let us define s ₀(n ) and s _x(n) as our two time domain sound signals at the time instant (sample index) n, and their z-transforms as 5 ₀(z) and SxCz). The two microphone signals (collectively indicated with 502) are m ₀(n ) and m^n), and their z-transforms are M ₀(z) and M _x(z) (Figure 2). The Room Impulse Responses (RIRs) from the i’s source to the s microphone are h _j (n) , and their z-transform Thus, our convolutive mixing system can be described in the z-domain as

In simplified matrix multiplication we can rewrite Equation (1) as

MO) = H(z) - S(z )

(2)

For an ideal sound source separation we would need to invert the mixing matrix H{z). Hence, our sound sources could be calculated as

Since det (H )) and diagonal elements of the inverse matrix are linear filters which do not contribute to the unmixing, we can neglect them for the separation, and bring them to the left side of eq. (3). This results in where H^iz) H ₁ (z) and H ₀ oO) #o _,iO) are now relative room transfer functions. Next we approximate these relative room transfer functions by fractional delays by di samples and attenuation factors a _i

#i/0) %0) « a _tz ^di

(5) where i,; ^'e{0,l}.

This approximation works particularly well when reverberation or echo is not too strong. For the fractional delays by d samples we use the fractional delay filter in the next section (2.1). Note that for simplicity we keep the linear filter resulting from the determinant and from the matrix diagonal //^(z) on the left-hand side, meaning there is no dereverberation.

An example is provided here with reference to Fig. 1: assume two sources in a free field, without any reflections, symmetrically on opposing sides around a stereo microphone pair. The distance (in samples) of sourceo to the near microphone shall be m ₀ = 50 samples, to the far microphone m ₁ = 55 samples. The sound amplitude shall decay according to the function k/ mf ) with some constant k. Then the room transfer functions in the z-domain are = fc/50 ² · z ^-50 and H ₀₁ (z) = a ^nc· ^one relative room transfer function is H _{Q Q} = for the other relative room transfer function. We see that in this simple case the relative room transfer function is indeed 0.825 z ^~s, exactly an attenuation and a delay. The signal flowchart of convolutive mixing and demixing process can be seen in Fig. 2b. (Fig. 2b: Signal block diagram of convolutive mixing and demixing process).

2.1. The fractional delay allpass filter

The fractional delay allpass filter for implementing the delay z ^di of eq. (5) plays an important role in our scheme, because it produces an MR filter out of just a single coefficient, and allows for implementation of a precise fractional delay (where the precise delay is not an integer value), needed for good cross-talk cancellation. The MR property also allows for an efficient implementation. [14, 15] describes a practical method for designing fractional delay allpass filters, based on MR filters with maximally flat group delay [19]. We use the following equations to obtain the coefficients for our fractional delay allpass filter, for a fractional delay t = d _t. Its transfer function in the z-domain is A(z), with [14] where D(z) is of order L = [t|, defined as:

The filter d(n) is generated as: d{ 0) = 1, for 0 < n < ( - 1).

2.2. Objective function

As objective function, we use a function D(P ₀, ^i) which is derived from the Kullback- Leibler Divergence (KLD), where P(n) and Q(n) are probability distributions of our (unmixed) microphones channels, and n runs over the discrete distributions.

In order to make the computation faster we avoid computing histograms. Instead of the histogram we use the normalized magnitude of the time domain signal itself, where n now is the time domain sample index. Notice, that P _j(n) has similar properties with that of a probability, namely:

1. Pi(n) > 0, Vn.

2. Sh=o (p) = 1. with i = 0, 1. Instead of using the Kullback-Leibler Divergence directly, we turn our objective function into a symmetric (distance) function by using the sum D _KL(P\\Q) + D _KL(Q\\P), because this makes our separation more stable between the two channels. In order to apply minimization instead of maximization, we take its negative value. Hence our resulting objective function £>(P ₀,Pi) is

2.3. Optimization

A widespread optimization method for BSS is Gradient Descent. This has the advantage that it finds the ’’steepest" way to an optimum, but it requires the computation of gradients, and gets easily stuck in local minima or is slowed down by "narrow valleys” of the objective function. Hence, for the optimization of our coefficients we use a novel optimization of "random directions”, similar to ’’Differential Evolution” [20, 21, 22] Instead of differences of coefficient vectors for the update, we use a weight vector to model the expected variance distribution of our coefficients. This leads to a very simple yet very fast optimization algorithm, which can also be easily applied to real time processing, which is important for real time communications applications. The algorithm starts with a fixed starting point [1.0, 1.0, 1.0, 1.0], which we found to lead to robust convergence behaviour. Then it perturbs the current point with a vector of uniformly distributed random numbers between -0.5 and +0.5 (the random direction), element-wise multiplied with our weight vector (line 10 in Algorithm 1). If this perturbed point has a lower objective function value, we choose it as our next current point, and so on. The pseudo code of the optimization algorithm can be seen in Algorithm 1. Where, minabskl_i (indicated as negabsklj in Algorithm 1) is our objective function that computes KLD from the coefficient vector coeffs and the microphone signals in array X. We found that 20 iterations (and hence only 20 objective function evaluations) are already sufficient for our test files (each time the entire file), which makes this a very fast algorithm.

The optimization may be performed, or example, at block 560 (see above).

Algorithm 1 is shown here below.

Algorithm 1 Optimization algorithm

1: procedure OPTIMIZE SEPARATION COEFFICIENTS(X)

2: INITIALIZATION

3: X 4— convolutive mixture

4: init coeffs = [1.0, 1,0, 1.0, 1.0} <— initial guess for separation coefficients 5: coeffwelghts o- weights for random search

6: coeffs = init coeffs 4— separation coefficients

7: negabsk!J ⁾( coeffs , X) «- calculation of KLD

8: OPTIMIZATION ROUTINE

9: loop:

10: coeffvarhition=(random(4 ) *coejfweights) 4— random variation of separation coefficients

11: negabskiJ ( coeffs+coeffiariation, X ) 4- calculation of new KLD 12: if negabskiJ < negabskLO then

13: negabskiJ) = negabskiJ

14: coeffs = eoeffs+coeffvariofion 4 update separation coefficients

3. EXPERIMENTAL RESULTS In this section, we evaluate the proposed time domain separation method, which we call AIRES (time domAIn fRactional dElay Separation), by using simulated room impulse responses and different speech signals from the TIMIT data-set [23]. Moreover, we made real-life recordings in real room environments. In order to evaluate the performance of our proposed approach, comparisons with Stateof- the- Art BSS algorithms were done, namely, time-domain TRINICON [11], frequency- domain AuxIVA [5] and ILRMA [6, 7] An implementation of the TRINICON BSS has been received from its authors. Implementations of AuxIVA and ILRMA BSS were taken from [24] and [25], respectively. The experiment has been performed using MATLAB R2017a on a laptop with CPU Core i7 8-th Gen. and 16Gb of RAM. 3.1. Separation performance with synthetic RIRs

The room impulse response simulator based on the image model technique [26, 27] was used to generate room impulse responses. For the simulation setup the room size have been chosen to be 7m x 5m x 3m. The microphones were positioned in the middle of the room at [3.475, 2.0, 1.5 ]m and [3.525, 2.0, 1 .5 ]m, and the sampling frequency was 16kHz. Ten pairs of speech signals were randomly chosen from the whole TIMIT data-set and convolved with the simulated RIRs. For each pair of signals, the simulation was repeated 16 times for random angle positions of the sound sources relatively to microphones, for 4 different distances and 3 reverberation times ( RTeo ). The common parameters used in all simulations are given in Table 1 and a visualization of the setup can be seen in Fig. 1. The evaluation of the separation performance was done objectively by computing the Signal-to- Distortion Ratio (SDR) measure [28], as the original speech sources are available, and the computation time. The results are shown in Fig. 3.

The obtained results show a good performance of our approach for reverberation times smaller than 0.2s. For RTeo = 0.05s the average SDR measure over all distances is 15.64dB and for RTeo = 0.1s it is 10.24dB. For a reverberation time RTeo = 0.2s our proposed BSS algorithm shares second place with ILRMA after TRINICON. The average computation time (on our computer) over all simulations can be seen in Table 2. As can be seen, AIRES outperforms all State-of-the-Art algorithms in terms of computation time.

By listening to the results we found that an SDR of about 8dB results in good speech intelligibility, and our approach indeed features no unnatural sounding artifacts.

Table 1: Parameters used in simulation: Room dimensions 7 m x 5 m x 3ra Microphones displa 0.05 m Reverberation time 0.05, 0.1, 0.2s Amount of mixes 10 random mixes Conditions for each 16 random angles x 4 distances

Characterization of Fig. 3 (showing performance evaluation of BSS algorithms applied to simulated data):

(a) RTeo = 0.05s (b) RTeo = 0.1s

Table 2: Comparison of average computation time (simulated datasets).

3.2. Real-life experiment

Finally, to evaluate the proposed sound source separation method, a real-life experiment was conducted. The real recordings have been captured in 3 different room types. Namely, in a small apartment room (3m x 3m), in an office room (7m x 5m) and in a big conference room (15m x 4m). For each room type, 10 stereo mixtures of two speakers have been recorded. Due to the absence of ’’ground truth” signals, in order to evaluate separation performance the mutual information measure [29] between separated channels has been calculated. The results can be seen in Table 3. Please note, that the average mutual information of the mixed microphone channels is 1.37, and the lower the mutual information between the separated signal the better the separation.

Table 3: Comparison of separation performance using Mean Mutual Information (Ml, real recordings).

From the comparison Table 3 it can be seen that the performance tendency for the separation of the real recorded convolutive mixtures stayed the same as for simulated data. Thus, one can conclude that AIRES despite its simplicity can compete with Prior-Art blind source separation algorithms.

4. CONCLUSIONS In this paper, we presented a fast time-domain blind source separation technique based on the estimation of MR fractional delay filters to minimize crosstalk between two audio channels. We have shown that estimation of the fractional delays and attenuation factors results in a fast and effective separation of the source signals from stereo convolutive mixtures. Forthis, we introduced an objective function which was derived from the negative Kullback-Leibler Divergence. To make the minimization robust and fast, we presented a novel ’’random directions” optimization method, which is similar to the optimization of ’’differential evolution”. To evaluate the proposed BSS technique, a set of experiments was conducted. We evaluated and compared our system with other State-of-the-Art methods on simulated data and also real room recordings. Results show that our system, despite its simplicity, is competitive in its separation performance, but has much lower computational complexity and no system delay. This also enables an online adaption for real time minimum delay applications and for moving sources (like a moving speaker). These properties make AIRES well suited for real time applications on small devices, like hearing aids or small teleconferencing setups. A test program of AIRES BSS is available on our GitHub [30].

Further aspects (see also examples above and/or below) Further aspects are here discussed, e.g. regarding a Multichannel or stereo audio source separation method and update method for it. It minimizes an objective function (like mutual information), and uses crosstalk reduction by taking the signal from the other channel(s), apply an attenuation factor and a (possible fractional) delay to it, and subtract it from the current channel, for example. It uses the method of “random directions” to update the delay and attenuation coefficients, for example.