Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETERMINING A ROOM RESPONSE OF A DESIRED SOURCE IN A REVERBERANT ENVIRONMENT
Document Type and Number:
WIPO Patent Application WO/2020/064089
Kind Code:
A1
Abstract:
The present disclosure relates to an apparatus and method for determining a room response of a desired source in a reverberant environment from an observed signal received by a plurality of microphones, wherein a direct-path steering vector for the plurality of microphones is determined based on a known direction and/or location of the desired source, a reverberant- path steering vector for the plurality of microphones is determined that maximizes a cross moment of a direct-path response to the observed signal and a reverberant-path response to the observed signal, and the room response is determined based on a combination of the direct-path steering vector and the reverberant-path steering vector. The signal power contained in the reverberant components of the observed signal is hence used to improve the signal-to-interference-plus-noise ratio of the target signal without additional knowledge as compared to a conventional direct-path beamformer.

More Like This:
Inventors:
JIN WENYU (DE)
SHERSON THOMAS (NL)
KLEIJN WILLEM BASTIAAN (NL)
SETIAWAN PANJI (DE)
Application Number:
PCT/EP2018/075994
Publication Date:
April 02, 2020
Filing Date:
September 25, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
JIN WENYU (DE)
International Classes:
G01H7/00; G10L21/0208; G10L21/0272; G10L21/0216; G10L21/038; H04R3/00
Foreign References:
EP2642768A12013-09-25
Other References:
STEFANAKIS NIKOLAOS ET AL: "Acoustic Beamforming in Front of a Reflective Plane", 2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), EURASIP, 3 September 2018 (2018-09-03), pages 26 - 30, XP033461538, DOI: 10.23919/EUSIPCO.2018.8553103
E. VINCENT; R. GRIBONVAL; C. FEVOTTE: "Performance Measurement in Blind Audio Source Separation", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 14, no. 4, 2006, pages 1462 - 1469
S. MARKOVICH; S. GANNOT; I. COHEN: "Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 17, no. 6, 2009, pages 1071 - 1083, XP011263241, DOI: doi:10.1109/TASL.2009.2016395
H. KUTTRUFF: "Room acoustics", 2009, CRC PRESS
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1 . An apparatus for determining a room response of a desired source (1 10) in a reverberant environment from an observed signal received by a plurality of microphones (100), comprising a processing circuitry (150) configured to: determine a direct-path steering vector for the plurality of microphones (100) based on a known direction and/or location of the desired source (1 10); determine a reverberant-path steering vector for the plurality of microphones (100) that maximizes a cross moment of a direct-path response to the observed signal and a reverberant- path response to the observed signal; and determine the room response based on a combination of the direct-path steering vector and the reverberant-path steering vector.

2. The apparatus of claim 1 , wherein the processing circuitry (150) is further configured to apply a relative scaling to the direct-path steering vector and the reverberant-path steering vector before combining the direct-path steering vector and the reverberant-path steering vector.

3. The apparatus of claim 2, wherein a scaling factor for the relative scaling is based on a ratio of standard deviations of the reverberant-path response to the observed signal and the direct-path response to the observed signal.

4. The apparatus of any one of claims 1 to 3, wherein the processing circuitry (150) is further configured to maximize a second order central cross moment of the direct-path response to the observed signal and the reverberant-path response to the observed signal.

5. The apparatus of any one of the preceding claims, wherein the processing circuitry (150) is further configured to determine the reverberant-path steering vector such that the reverberant-path steering vector is orthogonal to the direct-path steering vector.

6. The apparatus of any one of the preceding claims, the processing circuitry (150) is further configured to determine the direction and/or the location of the desired source (1 10) with respect to the plurality of microphones (100) based on source detection performed on images of the room recorded by a camera (120).

7. The apparatus of any one of the preceding claims, the processing circuitry (150) is further configured to determine the direction and/or the location of the desired source (1 10) with respect to the plurality of microphones (100) based on direct beamforming using the plurality of microphones (100).

8. The apparatus of any one of the preceding claims, the processing circuitry (150) is further configured to: determine the direct-path response to the observed signal using the direct-path steering vector; determine the reverberant-path response to the observed signal using the reverberant- path steering vector; and separate a source signal of the desired source (1 10) based on a combination of the direct-path response and the reverberant-path response.

9. The apparatus of claim 8, the processing circuitry (150) is further configured to maximize an SINR of the source signal over a plurality of time intervals of the observed signal.

10. The apparatus of one of the preceding claims, the processing circuitry (150) is further configured to transform the observed signal from a time domain into a frequency domain and to determine the room response for each frequency bin of the transformed observed signal.

1 1. The apparatus of any one of the preceding claims, further comprising: the plurality of microphones (100) configured to capture a multi-channel sound signal; wherein the processing circuitry (150) is further configured to perform temporal sampling on the multi-channel sound signal to generate the observed signal. 12. The apparatus of any one of the preceding claims, further comprising: a camera (120) configured and arranged to record images of a room corresponding to the room response.

13. A method for determining a room response of a desired source (1 10) in a reverberant environment from an observed signal received by a plurality of microphones (100), the method comprising: determining a direct-path steering vector for the plurality of microphones (100) based on a known direction and/or location of the desired source (1 10); determining a reverberant-path steering vector for the plurality of microphones (100) that maximizes a cross moment of a direct-path response to the observed signal and a reverberant-path response to the observed signal; and determining the room response based on a combination of the direct-path steering vector and the reverberant-path steering vector.

14. The method of claim 13, further comprising: separating a source signal of the desired source (1 10) using the determined room response.

15. A computer-readable medium storing instructions that when executed on a processor cause the processor to perform the method of claim 13 or 14.

Description:
DETERMINING A ROOM RESPONSE OF A DESIRED SOURCE IN A

REVERBERANT ENVIRONMENT

The present disclosure relates to a method and an apparatus for determining the room response of a particular source of an audio signal in a reverberant environment from an observed signal received by a plurality of microphones.

BACKGROUND

In the field of acoustic processing, a major challenge is how to separate a particular source signal from interference, for example to perform speech recognition or to rearrange an acoustic scene. A natural approach to address this challenge is to combine the signals recorded by a plurality of microphones to favor particular source locations over others. Generally, such a combination is linear and time-invariant. The advantage of such a multi-channel processing over single-channel processing is that it can enhance the signal-to-interference ratio without imposing distortion on the source signal.

To find the best linear combination, various paradigms have been used in the art. These include conventional beamforming that assumes knowledge of the desired source location and blind source separation that is based on the assumption of source independence. Blind source separation aims at separating all of the involved sources, regardless of their attribution to the desired or interfering sources. An overview of known blind source separation algorithms is presented in the article“Performance Measurement in Blind Audio Source Separation” by E. Vincent, R. Gribonval and C. Fevotte in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, 1462-1469, 2006.

Blind source separation basically determines the relative room response of the individual sources, i.e. the room response with an arbitrary filtering. The potential gain in the signal-to- interference ratio for blind source separation is higher than conventional beamforming as it can exploit the full spatial signatures of the desired source within a room. In contrast, in conventional beamforming only the direct-path signal component is used to improve the signal- to-interference ratio. The beamforming family of algorithms concentrate on enhancing the sum of the desired sources while treating all other signals as interfering sources. Typical examples for beamforming algorithms are the delay-and-sum beamformer and the filter-and-sum beamformer. An approach that lies between conventional beamforming and blind source separation is based on so-called generalized beamforming that is beamforming based on the relative room response rather than a physical direction. The notion of beamforming implies that the finding of the response is not blind, and the solution, hence, is not based on the assumption of independence.

In the past, generalized beamforming has been based on, for example, the knowledge of certain time segments where only the target source is active. Such time segments can then be used to determine the room response. An example of such an approach is described in the article “Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals” by S. Markovich, S. Gannot, and I. Cohen in IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, 1071 -1083, 2009.

In practical situations, it has proven difficult, however, to estimate the generalized beamformer. Furthermore, classic beamforming approaches generally rely on strong assumptions, for instance with respect to the location and the distance of the target source. In addition, it is desirable to further increase the signal-to-interference ratio, in particular for applications including teleconferencing, augmented reality, in-vehicle communication systems, and smartphones/tablets with two or more microphones.

SUMMARY OF INVENTION

The present disclosure provides a method and an apparatus for generalized beamforming with a gain in the signal-to-interference ratio that is higher than that of a conventional direct-path beamformer, with no additional information over that provided to a conventional direct-path beamformer. The method is particularly effective in removing defuse noise and the sensor self- noise from the estimate of the target signal. The method described below can be interpreted as a form of informed blind source separation.

According to one aspect of the present disclosure, an apparatus for determining a room response of a desired source in a reverberant environment from an observed signal received by a plurality of microphones is provided wherein the apparatus comprises a processing circuitry configured to: determine a direct-path steering vector for the plurality of microphones based on a known direction and/or location of the desired source; determine a reverberant- path steering vectorfor the plurality of microphones that maximizes a cross moment of a direct- path response to the observed signal and a reverberant-path response to the observed signal; and determine the room response based on a combination of the direct-path steering vector and the reverberant-path steering vector. The processing circuitry may include a dedicated direct-path beamforming unit configured to determine the direct-path steering vector for the plurality of microphones based on the known direction and/or location of the desired source. The processing circuitry may further include a dedicated reverberant-path beamforming unit configured to determine the reverberant-path steering vector for the plurality of microphones that maximizes the cross moment of the direct- path response to the observed signal and the reverberant-path response to the observed signal. Furthermore, the processing circuitry may include a dedicated combiner configured to determine the room response based on the combination of the direct-path steering vector and the reverberant-path steering vector.

According to a further aspect, the processing circuitry may be further configured to apply a relative scaling to the direct-path steering vector and the reverberant-path steering vector before combining the direct-path steering vector and the reverberant-path steering vector. The processing circuitry may include a dedicated scaling unit configured to apply the relative scaling to the direct-path steering vector and the reverberant-path steering vector before passing the direct-path steering vector and the reverberant-path steering vector to the combiner.

The scaling factor for the relative scaling may be based on a ratio of standard deviations of the reverberant-path response to the observed signal and the direct-path response to the observed signal.

According to a further aspect, the processing circuitry may be further configured to maximize a second order central cross moment of the direct-path response to the observed signal and the reverberant-path response to the observed signal. It may in particular be the reverberant- path beamforming unit that is configured to maximize the second central cross moment of the direct-path response to the observed signal and the reverberant-path response to the observed signal.

According to a further aspect, the processing circuitry may be further configured to determine the reverberant-path steering vector such that the reverberant-path steering vector is orthogonal to the direct-path steering vector. It may in particular be the reverberant-path beamforming unit that is further configured to determine the reverberant-path steering vector such that the reverberant-path steering vector is orthogonal to the direct-path steering vector.

According to a further aspect, the processing circuitry may be further configured to determine the direction and/or the location of the desired source with respect to the plurality of microphones based on source detection performed on images of the room recorded by a camera. The room may in particular correspond to the reverberant environment. The apparatus may in particular comprise an image processing unit configured to determine the direction and/or the location of the desired source with respect to the plurality of microphones based on source detection performed on images of the room recorded by a camera. The image processing unit may be provided as part of the processing circuitry or separately.

According to a further aspect, the processing circuitry may be further configured to determine the direction and/or a location of the desired source with respect to the plurality of microphones based on direct beamforming using the plurality of microphones. In particular, a dedicated delay-sum beamforming unit may be provided that is configured to determine the direction and/or the location of the desired source with respect to the plurality of microphones based on direct beamforming using the plurality of microphones.

According to a further aspect, the processing circuitry may be further configured to: determine the direct-path response to the observed signal using the direct-path steering vector; determine the reverberant-path response to the observed signal using the reverberant-path steering vector; and separate a source signal of the desired source based on a combination of the direct-path response and the reverberant-path response. A dedicated direct-path signal estimator may be provided that is configured to determine the direct-path response to the observed signal using the direct-path steering vector. Furthermore, a dedicated reverberant- path signal estimator may be provided that is configured to determine the reverberant-path response to the observed signal using the reverberant-path steering vector. In addition, a signal combiner may be provided that is configured to separate the source signal of the desired source based on the combination of the direct-path response and the reverberant-path response.

The processing circuitry may in particular, be further configured to maximize an SINR of the source signal over a plurality of time intervals of the observed signal. A dedicated signal-to- interference-plus-noise ratio (SINR) unit may be provided to that is configured to maximize the SINR of the source signal over the plurality of time intervals of the observed signal.

According to a further aspect, the processing circuitry may be further configured to transform the observed signal from a time domain into a frequency domain and to determine the room response for each frequency bin of the transformed observed signal. A Discrete Fourier transform (DFT) unit may be provided that is configured to transform the observed signal from the time domain into the frequency domain.

The apparatus may further comprise the plurality of microphones configured to capture a multi- channel sound signal, wherein the processing circuitry is further configured to perform temporal sampling on the multi-channel sound signal to generate the observed signal. A dedicated sampling unit may be provided that is configured to perform temporal sampling on the multi-channel sound signal to generate the observed signal. The apparatus may further comprise a camera configured and arranged to record images of a room corresponding to the room response. More than one camera may be provided to record images of the entire reverberant environment.

According to another aspect of the present disclosure, a method for determining a room response of a desired source in a reverberant environment from an observed signal received by a plurality of microphones is provided wherein the method comprises: determining a direct- path steering vectorfor the plurality of microphones based on a known direction and/or location of the desired source; determining a reverberant-path steering vector for the plurality of microphones that maximizes a cross moment of a direct-path response to the observed signal and a reverberant-path response to the observed signal; and determining the room response based on a combination of the direct-path steering vector and the reverberant-path steering vector.

According to a further aspect, the room response may be determined based on the combination of the direct-path steering vector and the reverberant-path steering vector after relative scaling of the direct-path steering vector and the reverberant-path steering vector.

The relative scaling factor may be based on a ratio of standard deviations of the direct-path response to the observed signal and the reverberant-path response to the observed signal.

The cross moment may be a second order central cross moment of the direct-path response to the observed signal and the reverberant-path response to the observed signal.

According to a further aspect, the reverberant-path steering vector may be orthogonal to the direct-path steering vector.

According to a further aspect, the method may further comprise determining the direction and/or the location of the desired source with respect to the plurality of microphones based on source detection using image processing of images of the room recorded by a camera. Images recorded by more than one camera may be processed to detect the source.

The method may further comprise determining the direction and/or the location of the desired source with respect to the plurality of microphones based on direct beamforming using the plurality of microphones.

According to a further aspect, the room response may be determined for each frequency bin of a time-frequency transform of the observed signal.

The method may further comprise separating a source signal of the desired source using the determined room response. The method may in particular comprise maximizing a signal-to-interference-plus-noise ratio (SINR) of the separated source signal over a plurality of time intervals of the observed signal.

Furthermore, a computer-readable medium storing instructions that when executed on a processor cause the processor to perform any one of the above described methods is provided.

Any one of the direct-path beamforming unit, the reverberant-path beamforming unit, the combiner, the scaling unit, the image processing unit, the delay-sum beamforming unit, the direct-path signal estimator, the reverberant-path signal estimator, the signal combiner, the signal-to-interference-plus-noise ratio (SINR) unit, the Discrete Fourier transform (DFT) unit, and the sampling unit may be implemented as a software module or as a separate unit of the processing circuitry. The units may be implemented as a combination of software and hardware. The described processing may be performed by a chip, such as a general-purpose processor, a CPU, a GPU, a digital signal processor (DSP), or a field programmable gate array (FPGA), or the like. However, the present disclosure is not limited to implementation of the above-mentioned units on programmable hardware. One or more of the units may also be implemented on an application-specific integrated circuit (ASIC) or by a combination of the above-mentioned hardware components.

BRIEF DESCRIPTION OF DRAWINGS

In the following, exemplary embodiments are described in more detail with reference to the attached figures and drawings, in which:

Figure 1 shows a desired source and interfering sources in a reverberant environment to demonstrate the underlying problem solved by the present disclosure.

Figure 2 schematically shows details of a generalized beamformer according to a first embodiment of the present disclosure.

Figure 3 schematically show details of a generalized beamformer according to a second embodiment of the present disclosure.

DETAILED DESCRIPTION

The present disclosure relates to the general technical field of acoustic processing. It offers a way to separate a desired source from interfering sources and noise in a reverberant environment. The reverberant-path room response for the desired source is included to achieve a higher signal-to-interference-plus-noise ratio as compared to conventional direct- path beamformers.

A reverberant environment may be a closed room, such as the cabin of a vehicle or a room in a building, but may also be given by an open space wherein at least some objects that reflect an acoustic signal emitted by the desired source are present. Consequently, the methods and apparatuses of the present disclosure described below can be advantageously applied in a number of different scenarios. Typical applications include teleconferencing, augmented reality, in-vehicle communication systems, smartphones or tablets with two or more microphones, and the like.

The described methods and apparatuses may be used to better isolate a speech signal from a noisy and interfering environment wherein the speaker may be moving. The improved SINR of the isolated speech signal also helps to reduce errors in speech recognition processing and thus contributes to the comfort and safety of operating speech-operated devices, such as navigation systems, home theaters, and voice command devices that operate as front ends for web services, such as streaming audio, music, books, video and other digital content.

The room transfer function (RTF) or acoustic transfer function (ATF) demonstrates the collective effect of multipath propagation of sound between a source, such as a speaker, and a receiver, such as an array of microphones, within a given acoustical room. Accurate modeling of the ATF is useful in sound field simulators as well as many other applications such as sound reproduction, sound field equalization, echo cancellation, and speech dereverberation.

When the spatial information is neglected, i.e. when both source and receivers are point and omnidirectional, the whole information about the ATF is contained in the impulse response of the corresponding acoustic room, in short room impulse response (RIR), under the common hypothesis that the acoustics of a room is a linear, time-invariant system. Under the assumption that the ATF behaves as a linear time-invariant system, a finite impulse response (FIR) filter may be used to model the ATF.

A typical approach for determining an ATF of an acoustic room is to measure the room response of a known signal with one or more microphones. The ATF is, however, extremely sensitive to the locations of the source of the signal and the one or more receivers, i.e. microphones. This is a direct result of the acoustic room being a reverberant environment wherein reflections of the emitted signal off walls or objects contribute to the room response. The acoustics of a reverberant environment includes both time-domain effects, such as echoes, discrete reflections, and statistical reverberant tail, and frequency-domain effects, such as frequency response and frequency-dependent reverberation. The present disclosure deals with the problem of estimating an acoustic transferfunction (ATF) with respect to a particular (desired) source in a reverberant environment, such as a closed room, for instance in a home, a vehicle’s cabin, or any acoustic room with at least one object contributing to the reflection of sound.

The basic scenario is schematically shown in Figure 1 wherein the box schematically shows the delimitation of the acoustic room. A single desired source 1 10 is shown inside the acoustic room, emitting the acoustic source signal s . In addition, at least one stationary interfering source 1 12 emitting the source signal s[ and at least one nonstationary interfering source 1 14 emitting the source signal are present inside the acoustic room. The following description assumes that only the source signal of a single desired source shall be isolated while all other source signals are treated as noise. The present disclosure is, however, not limited to a single desired source but can easily be extended to isolate a plurality of desired source signals.

As schematically shown in Figure 1 , acoustic waves emitted by the desired source 1 10 can reach the microphone array 100 along a direct path as well as along a reverberant path. Generally, a room impulse response obtained from a sound source in a specific position of a real environment can be divided into three parts: direct sound, early reflections, and late reflections. Without restriction, early and late reflections are combined in the following as a reverberant-path contribution to the observed signal y = [y t y 2 y 3 ] T wherein T indicates the transpose. According to the illustrative, nonlimiting example of Figure 1 , an array of three microphones 100 is used to measure the room response as the observed signal y. Each microphone signal is subject to inherent sensor noise such that the entire microphone array 100 adds sensor noise v = [v v 2 v 3 ] T to the observed signal.

The microphones may be any microphone known in the art, either omnidirectional or unidirectional. The plurality of microphones may be implemented as a microphone array, in particular, as a linear, a circular, or a spherical or hemispherical microphone array. The microphones may be distributed, in particular equidistantly, along a line, over a circle or across a surface to implement the plurality of microphones. Without restriction, the plurality of microphones is referred to in the present disclosure as a microphone array.

Each microphone is configured to capture an audio signal or microphone signal, wherein the microphone signal may be captured as an envelope signal or a digital signal. In the case of an envelope or analog signal, the audio signal or microphone signal may further be converted into a digital signal using an A/D converter. In the case of a plurality of microphones, the observed microphone signal is a multi-channel signal. For simplicity, the terms “audio signal” and “observed signal” are used in the following as general terms for digital signals captured by the plurality of microphones and processed by the processing circuitry, for instance by a processor unit of the processing circuitry. The processing will be described in detail below with respect to Figures 2 and 3.

Using the matrix of filters formalism, the observed signal y can be expressed with respect to the source signal s of the desired source by means of the acoustic transfer function A according to Equation (1 ): y = A * s + n (1 ) wherein * denotes convolution, n summarizes sensor noise and all interfering sources as noise, and the observed signal is represented as a discrete (digital) signal over a time frame or window, for instance as y = [y(0) ... y(T— l)] 7 .

Equation (1 ) can be transformed from the time domain into the frequency domain, for instance using a fast Fourier transform (FFT), a Discrete Fourier transform (DFT) or the like. In the frequency domain, Equation (1 ) simply reads: y = As + n (2) wherein for the sake of simplicity, the same symbols are used for the transformed quantities.

As shown in Figure 1 , the observed signal y can generally be decomposed into contributions from the interfering sources, the desired sources and the sensor noise as in Equation (3): y = H d s d + H s s s + H ns s ns + v (3) wherein H d s d , H s s s , and H ns s ns are the (relative) complete room responses of the different sources. By combining H s s s , H ns s ns , and v as the noise n , Equation (3) turns into Equation (2) wherein the acoustic transfer function A is given by the transfer function H d for the desired source.

In a reverberant environment, the transfer function H d can be decomposed into a direct-path contribution and a reverberant-path contribution as in Equation (4):

The fundamental problem underlying the present disclosure is to determine a time-invariant linear demixing system W = A ~ for the mixing system , i.e. the acoustic transfer function with regard to the desired source 1 10 and the microphone array 100. The linear demixing system can then be used to isolate the (desired) response to the observed signal x d from the entire observed signal.

As indicated above, the desired response to the observed signal can be decomposed into a direct-path component and a reverberant-path component as shown in Equation (5): L L direct + d

X reverberant (5)

The present disclosure uses a modified generalized beamformer 150 to estimate the desired response to the observed signal x d 190 using weighting vectors W d that are applied to the entire observed signal according to Equation (6): x d = W d y (6) wherein the weighting vectors are also decomposed into a direct-path component and a reverberant-path component according to Equation (7): w d = w d

direct + W reverberant (7)

The desired response to the observed signal 190 may in particular be determined for a particular frame number m and frequency bin index k.

For simplicity, in the following, the superscript d for indicating the desired source-related quantities will be omitted and the direct-path and the reverberant-path components of the acoustic transfer function will simply be denoted as a d and a r , respectively. Equation (2) can then be written as Equation (8): y = (a d + a r )x + n (8)

The below described approach for finding the room response and the corresponding ATF is based on the assumption that the direction or the fully-specified spatial location of the desired source, i.e. the target source, is known. This is a reasonable assumption in many practical situations as will be described below with respect to Figures 2 and 3.

From the knowledge of the direction or full location of the target source, the direct-path relative transfer function a d can be computed. For a known source location, the free-space Green’s function can be used as for instance described in H. Kuttruff, Room acoustics, CRC Press, 2009. If only the direction toward the target source is known, a plane-wave (far field) response function may be used.

In the frequency domain, the direct-path relative transfer function a d corresponds to a steering vector w d for the direct path. Based on the knowledge of the steering vectorfor the direct path, a first estimate of the desired signal, i.e. the desired response to the observed signal, may be calculated. The method and apparatus described in the present disclosure improve this first estimate by adding a reverberant-path component.

A first embodiment of a generalized beamformer according to the present disclosure is shown in a schematic way in Figure 2. A direct-path beamforming unit 130 is provided to determine a direct-path steering vector w d and a reverberant-path beamforming unit 140 is provided to determine a reverberant-path steering vector w r . Based on a combination of the direct-path steering vector and the reverberant-path steering vector, the room response of the desired source W d is determined.

The present disclosure directly recovers the unknown reverberant portion of the acoustic transfer function based purely on prior knowledge of the direct part of the ATF. As discussed above, the direct-path component of the acoustic transfer function can be derived from knowledge of the direction and/or the location of the desired source with respect to the plurality of microphones. The direction can be expressed as a direction-of-arrival (DOA) expressing the direction from the plurality of microphones toward the desired source. The desired source may in particular be a speaker, i.e. a person, or an audio source such as a loudspeaker. The DOA may be expressed as a planar angle if the plurality of microphones and the desired source are arranged in one plane or may generally be expressed as a solid angle.

The method and apparatus according to the present disclosure requires only knowledge of the direction toward the desired source for the determination of the direct-path component of the acoustic transfer function. Thus, the assumptions imposed by the present disclosure are weaker than those of classic beamforming approaches that, in addition to the direction toward the desired source, also assume that the distance between the plurality of microphones and the desired source is known.

The direction toward the desired source, such as the DOA, can for instance, be derived through image processing of one or several images of the reverberant environment including the desired source using one or several cameras 120. The cameras 120 may record individual images or a video consisting of a sequence of image frames. Pattern and/or face recognition techniques known in the art may be used to identify a potential speaker, i.e. a person, and/or an audio source such as a loudspeaker in the recorded images. Based on a sequence of images, it may further be determined whether the detected person is actually speaking. Based on a known location of the one or more cameras with respect to the acoustic room, the direction toward the speaker may be determined from the image data. In case a stereoscopic camera is provided, the distance to the desired source may be additionally determined by analyzing images from the stereoscopic camera as generally known in the art.

To determine the direction toward the desired source and/or the location of the desired source, in particular the distance to the desired source, a dedicated image processing unit 125 may be provided to perform the above described image processing. To this end, the image processing unit 125 receives images or image frames of a video from the camera 120, processes the received data to determine at least the direction toward the desired source, and provides the determined direction, and optionally location, to the direct-path beamforming unit 130. Although a camera 120 is presented in Figure 2 for recording the reverberant environment including the source object, one or several sensors that perform ranging of the acoustic room, such as a radar sensor or an infrared sensor, may be used to determine at least the direction toward the desired source. By way of example, an infrared sensor may be used to determine the relative location or locations of one or several human beings with respect to the microphone array. The sensors may operate based on individual rays or be provided to record continuous scanning images of the environment. Both, the camera or cameras 120 and the ranging sensors may be located and oriented inside the acoustic room so as to capture at least those regions of the room where a desired source may be present. Ideally, the camera 120 and/or the one or more ranging sensors cover the entire area of the acoustic room from which the microphone array can capture sound waves.

As described above, the direct-path component a d of the acoustic transfer function between the desired source and the microphone array can be determined based on the knowledge of the direction toward the desired source. To determine the direct-path response to the observed signal, a direct-path steering vector is determined by the direct-path beamforming unit 130 according to Equation (9):

ad

w d = (9)

ad a d

wherein H denotes the conjugate transpose. The direct-path steering vector is thus normalized.

As described above, the sound field generated by the sources including the desired source is recorded by the microphone array 100 as shown in Figure 2. The recorded multi-channel signal is sampled using a sampling unit 104 and converted into a digital signal, e.g., using an A/D converter. Consecutive, potentially overlapping time frames or windows of the sampled multi- channel signal are then transformed from the time domain into the frequency domain, for instance using a DFT unit 106. The present disclosure is, however, not limited to using a Discrete Fourier transform but may be implemented using other transforms known in the art. Generally, it is assumed that the length of the window or time frame is much larger than the length of the acoustic transfer function such that Equation (2) applies. In addition to the sampling unit 104 and the DFT unit 106, the generalized beamformer according to the present disclosure may include further components as known in the art, such as one or several filtering units for filtering the captured multi-channel audio signals, for instance a high pass filter to block signal parts that are typically heavily overlaid by noise on the one hand and do not contain parts of the desired speech signal and/or a low-pass filter to block signal parts outside a typical speech spectrum, if the desired source is a human speaker. Furthermore, one or several amplifiers for amplifying captured audio signals may be provided. Further components as known in the art may be provided as part of the generalized beamformer.

Based on the direct-path steering vector w d , the direct-path response to the observed signal x d is estimated according to Equation (10) by means of the direct-path signal estimator 160:

*d = w d. y (10)

This linear estimate of the direct-path response to the observed signal therefore corresponds to performing delay-sum beamforming for the microphone array with the direct-path steering vector as the delay filter.

With regard to the multi-channel observed signal, the direct-path steering vector w d derived from the direction toward the desired source specifies a subspace of the microphone space. The direct-path signal component, of which the direct-path response to the observed signal is a linear estimate, lives in this subspace of the microphone space. For the corresponding delay- sum beamformer, this subspace is also the steering vector subspace. In other words, the direct-path vector signal component is the projection of the full-room vector signal of the desired source onto the direct-path steering vector. Under the assumption that no point interferes are present with a non-zero component in the direction of the direct-path steering vector, the linear estimate x d determined by projecting the entire observed signal y into the subspace of the direct-path steering vector w d = W direct gives a good estimate for the direct- path vector signal component of the desired source.

The objective for separating a source signal of the desired source is to find the full-room vector signal of that desired source. The full-room vector signal, however, corresponds to the sum of the direct-path vector signal and the reverberant-path vector signal. As the direct-path steering vector is known, finding the full-room steering vector is equivalent to finding the reverberant- path steering vector, thus, finding the reverberant-path component of the acoustic transfer function.

To recover the unknown reverberant-path component of the acoustic transfer function, the acoustic transfer function is partitioned as shown in Equation (1 1 ): a = (1 + a)a d + a r (1 1 ) wherein the scalar a = (a d , a r ) is determined from the inner product of the direct-path component a d and the reverberant-path component a r of the acoustic transfer function and a r is accordingly set to a r = a r — aa d . Consequently, a d and a r are orthogonal. In a noiseless case, the delay-sum beamformer aimed at the subspace of a r should produce an estimate of the reverberant-path response to the observed signal x r that is highly correlated with the estimate of the direct-path response to the observed signal x d .

Consequently, the following optimization problem to find this delay-sum beamformer, that is simply a scaled version of the modified reverberant-path component of the acoustic transfer function, can be defined:

s. t. Im(E[w d yy H w r Y) = 0 (13)

11 r 112 = 1 (14) w^w d = o (15) wherein Re{ ) denotes the real part, /m( ) denotes the imaginary part, E[ ] denotes the expectation, and || || 2 denotes the 2-norm. The expectation E[ ] can be accumulated over time, e.g., the frame number m, and/or over the frequency bin index k.

According to Equation (12), the reverberant-path component w r of the acoustic transfer function is determined by maximizing the real part of the expectation of the inner product ( x d , x r ) of the direct-path response to the observed signal x d = w d y and the reverberant-path response to the observed signal x r = under the conditions that the expectation is purely real (Equation (13)), the reverberant-path component w r is normalized (Equation (14)), and the reverberant-path component w r is orthogonal to the direct-path component w d (Equation (15)).

Equations (12) to (15) therefore determine the unit-norm reverberant-path steering vector that maximizes a cross moment of the direct-path response to the observed signal and the reverberant-path response to the observed signal. In the specific, nonlimiting case of Equations (12) to (15), the reverberant-path steering vector maximizes a second order, mean- free cross moment between the direct-path response to the observed signal and the reverberant-path response to the observed signal. The present disclosure is, however, not limited to maximizing a mean-free cross moment but can alternatively maximize a second order central cross moment of the direct-path response to the observed signal and the reverberant- path response to the observed signal. Thus, the reverberant-path steering vector may be determined by maximizing the covariance of the direct-path response to the observed signal and the reverberant-path response to the observed signal. Alternatively, the correlation between the direct-path response to the observed signal and the reverberant-path response to the observed signal may be maximized by dividing the covariance by the respective standard deviations s ά and s n of the direct-path response to the observed signal and the reverberant- path response to the observed signal, respectively. Other cross moments of even order may be maximized as an alternative.

Determining the reverberant-path steering vector w r according to Equations (12) to (15) by maximizing the covariance between the direct-path response to the observed signal obtained with the direct-path steering vector w d and the unknown reverberant-path response to the observed signal makes use of the strong correlation between the direct-path room response and the reverberant-path room response of the desired source. In the exemplary embodiment of the generalized beamformer 150 according to Figure 2, the above described maximization is performed by a reverberant-path beamforming unit 140 wherein the reverberant-path response to the observed signal x r is determined by a reverberant-path signal estimator 170. It is understood that the reverberant-path signal estimator may be incorporated into the reverberant-path beamforming unit 140.

Whilst the optimization problem in Equations (12) to (15) is non-convex, it does have a simple analytical solution. In particular, it can be shown that gives a solution to the optimization problem.

Here, / denotes the identity matrix and P y denotes the covariance matrix of the observed signal y according to Equation (17):

P y = (a d + a r )P x {a d + a r ) H + P n (17) wherein it is assumed that the target signal of the desired source is independent of the noise.

As in the case of the blind source separation algorithm, Equation (16) produces a normalized steering vector w r that estimates the reverberant-path component of the true acoustic transfer function of the desired source. To create an accurate estimate of the target subspace, this reverberant-path component is scaled in a second step.

By again exploiting the known direct-path component of the acoustic transfer function, the relative gain of the direct-path steering vector and the reverberant-path steering vector can be determined. The proper relative gain of the direct-path component and the reverberant-path component of the acoustic transfer function can be determined from the root-mean-square power of the corresponding direct-path response to the observed signal x d and the reverberant-path response to the observed signal x r . In other words, the standard deviations o d and o r of the direct-path response to the observed signal and the reverberant-path response to the observed signal, respectively, are determined to estimate the ratio of the direct- path and reverberant-path acoustic transfer functions of the desired source as shown in Equations (18) and (19):

(18)

(19)

The standard deviations according to Equations (18) and (19) can be determined by the direct- path signal estimator 160 and the reverberant-path signal estimator 170 according to Figure 2, respectively. A scaling unit (not shown) can then apply the determined gain as a relative scaling to the direct-path steering vector and/or the reverberant-path steering vector before combining the direct-path steering vector and the reverberant-path steering vector to determine the room response.

In particular, an appropriately scaled estimate of the true acoustic transfer function w est for the desired source can be expressed as in Equation (20):

A signal combiner 180 thus determines the room response by estimating the true acoustic transfer function w est from a scaled combination of the direct-path component of the acoustic transfer function and the reverberant-path component of the acoustic transfer function.

As mentioned above, the acoustic transfer function for the desired source with respect to the microphone array sensitively depends on the location of the desired source. Therefore, if the desired source is a human speaker, movement of the speaker with respect to the microphone array necessitates recalculation of the acoustic transfer function w est . The described method may therefore determine the acoustic transfer function w est for each speech frame, i.e. time frame that contains a speech signal. This may also be done while the speaker remains stationary. The resulting estimates of the acoustic transfer function w est may then be used by a signal-to-interference-plus-noise ratio (SINR) unit 185 to maximize the SINR of the source signal of the desired source over a plurality of time intervals of the observed signal.

Based on a combination of the direct-path response to the observed signal x d determined by the direct-path signal estimator 160 and the reverberant-path response to the observed signal x r determined by the reverberant-path signal estimator 170, the signal combiner 180 may separate a source signal x d of the desired source by projecting the observed signal y into the subspace of the estimated acoustic transfer function w est . Consequently, the source signal of the desired source may be isolated as shown in Equation (21 ):

X d = Wes t y (21 )

The method of isolating a source signal according to Equation (21 ) may be seen as a filter- sum beamformer based on a scaled combination of the direct-path steering vector and the reverberant-path steering vector described above.

In the case of an correlated noise with a scaled identity covariance matrix, the above described method alone provides an effective means of estimating the acoustic transfer function of the desired source. In addition, the estimate of the acoustic transfer function may be used as a preconditioning step to improve the performance of a blind source separation algorithm by replacing the original estimate of the acoustic transfer function, i.e., the direct-path acoustic transfer function, with the estimate w est .

An alternative embodiment of the generalized beamformer 150 of the present disclosure is schematically shown in Figure 3. According to this embodiment, the source location and/or direction are determined using a delay-sum beamforming unit 220 instead of the camera 120. A direction-of-arrival (DOA) angle with respect to the microphone array may be determined as a function of the delay of a speech signal between the individual audio signals captured by the microphones of the microphone array. Such a delay may for instance, be computed by a cross- correlation of the different microphone signals. Other techniques as known in the art for determining the DOA based on a multi-channel microphone signal may be applied to determine at least the direction toward the desired source by the delay-sum beamforming unit 220. The determined direction and/or source location are then provided as input to the direct-path beamforming unit 130.

Before determining the DOA of a human speaker, the delay-sum beamforming unit 220 may perform voice activity detection on the observed signal to determine whether the observed signal contains a speech signal.

Voice activity detection may be carried out based on measures determined from the observed signal, wherein the different measures include spectral slope, correlation coefficients, log likelihood ratios, cepstral and weighted cepstral coefficients, which are determined from the Fourier coefficients of the logarithm of the spectral density, as well as modified distance measures, short-time energy, zero-crossing rate, linear prediction coefficients, spectral entropy, a least-square periodicity measure, and wavelet transform coefficients. Voice activity detection may further include a noise reduction stage, e.g. by a spectral subtraction, filtering for echo compensation and/or determining signal coherence of two or more audio signals captured by spaced-apart microphones in order to filter out diffuse background noise and/or sound reflections.

To prevent that human utterances, which are not verbal utterances, like sneezing, coughing, whistling or the like, are accidentally detected as a speech signal, the delay-sum beamforming unit may further perform speech recognition on the observed signal, including detecting phonemes, words, phrases and/or sentences in the microphone signal.

Speech recognition may be carried out according to any of the methods known in the art. In particular, a speech recognition method may be based on hidden Markov models using cepstral coefficients. The employed hidden Markov model may further involve context dependency for phonemes, cepstral normalisation to normalise for different speakers and/or recording conditions, vocal tract length normalisation (VTLN) for male/female normalisation, and/or maximum likelihood linear regression (MLLR) for more general speaker adaptation. Aside from using the coefficients alone, their temporal dynamics may be included using so- called delta and delta-delta coefficients. Alternatively, splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis (HLDA) may be used. A speech recognition system based on hidden Markov models may further be adapted using discriminative training techniques, such as maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE). As an alternative to hidden Markov models, the speech recognition method may be based on dynamic time warping (DTW). Also, neural networks may be used for certain aspects of speech recognition such as phoneme classification, isolated word recognition and speaker adaptation. Further, neural networks may be used as a pre-processing step to the actual speech recognition step. Other methods, which may be used for detecting a speech signal in the microphone signal using speech recognition include, but are not limited to, power spectral analysis (FFT), linear predictive analysis (LPC), wherein a specific speech sample at a current time can be approximated as a linear combination of past speech samples and wherein the predictor coefficients are transformed to cepstral coefficients, perceptual linear prediction (PLP), which is based on the short term spectrum of speech and uses several psycho-physically based transformations, mel scale cepstral analysis (MEL), wherein the spectrum is warped according to the MEL scale and wherein cepstral smoothing is used to smooth the modified power spectrum, relative spectra filtering (RASTA) to compensate for linear channel distortions, which can be used either in the log spectral or cepstral domains, and energy normalisation to compensate for variances in loudness, in the microphone recording as well as in the signal energy between different phoneme sounds. Finally, statistical language modelling may be used for speech recognition, wherein a statistical language model assigns probability distributions to words and sentences of a language. In statistical language modelling, at least one of the spoken words is recognised on the grounds of one or more recognised preceding words. An example for statistical language modelling may be given, for instance, by the well-known n-gram language modelling.

The speech detection may be carried out continuously on a frame-by-frame basis, or with a predetermined, possibly configurable, frame-size, respectively overlap, for instance once every 500 ms or every second. In addition, a processor unit of the delay-sum beamforming unit may be adapted to periodically or continuously check for a speech signal in the captured multi-channel signal.

Detecting a speech signal may further comprise detecting speech activity of at least two different human speakers using voice recognition based on speaker recognition methods. While speaker recognition generally denotes the art of identifying the person who is speaking by characteristics of their voices, so-called biometrics, voice recognition according to the present embodiment may be limited to detecting that speech signals of at least two different human speakers are comprised in the detected speech signal. This may be achieved by a spectral analysis of the speech signal and by identifying at least two different spectral characteristics of the speech signal, without comparing the detected spectral characteristics and/or voice biometrics to predetermined spectral characteristics and/or voice biometrics associated with a specific person.

The speech signals of the at least two different human speakers may be contained in the detected speech signal at the same time, i.e. when at least two different human speakers utter verbal sounds simultaneously, or may be contained in the detected speech signal in different, possibly consecutive and/or overlapping, time intervals of the detected speech signal, i.e. in the case of an actual conversation between the at least two human speakers. The speaker recognition or speaker differentiation may be carried out using frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, vector quantization, decision trees, a sequence of covariance lags of the spectral density of the signal, an autoregressive moving-average (ARMA) model, a spectral analysis based on the pitch of the detected speech signal, a detection of formants in the spectrum, or any other spectral characteristics as known in the art.

In the context of the present disclosure, detecting speech signals of at least two different human speakers may be used to determine individual directions or DOAs with respect to these speakers to isolate source signals of more than one desired source from the observed signal. As the remaining units and processing steps of the generalized beamformer of Figure 3 correspond to the respective units and processing steps of the generalized beamformer of Figure 2, a repeated description is omitted to avoid obscuring the present disclosure. It shall, however, be understood that the variations and modifications described above with respect to the generalized beamformer of Figure 2 can also be applied to the generalized beamformer of Figure 3.

In addition, a generalized beamformer comprising the delay-sum beamforming unit 220 and one or more cameras 120 may be provided. In this case, the DOA determined by the delay- sum beamforming unit 220 based on a captured speech signal may be used to single out a human speaker from a group of detected potential speakers as provided by the image processing unit 125. The output of the delay-sum beamforming unit 220 may also be used to confirm or correct the output of the image processing unit 125 and vice versa.

The described methods and apparatuses provide a generalized beamformer with the ability to obtain a gain in the signal-to-interference ratio that is higher than that of a conventional direct- path beamformer with no additional information over that provided to a conventional direct- path beamformer. Thus, the generalized beamformer computes the relative room response using the same information that is available to a conventional direct-path beamformer. The described generalized beamformer is particularly effective in removing diffuse noise and sensor self-noise from the estimate of the target signal of the desired source.

If point interferes are present with a non-zero component in the direction of the direct-path steering vector, the performance of the described generalized beamformer will drop, as it is the case for conventional direct-path beamforming. However, by optimizing over different time intervals and searching for maximum gain, it is possible to detect intermittent interferes and suppress their impact on the isolated source signal. The described method of generalized beamforming may be interpreted as a form of informed blind source separation.

Different from commonly known beamforming techniques that aim at speech dereverberation, the described generalized beamforming makes use of the significant signal power in the reverberant-path components of the target signal to improve the SINR. Furthermore, distinguishing between the direct-path and the reverberant-path signals for the beamforming calculation by first determining the steering vectors for the direct-path and reverberant-path components and then combining them provides a new way of combining the direct-path response and the reverberant-path response.

The improved signal-to-interference-plus-noise ratio of the isolated source signal further improves the results of the above described speech recognition techniques and correspondingly reduces the error-rate of operating devices such as navigation systems, voice command devices, or home entertainment systems based on speech.

The above described generalized beamforming may be performed and implemented by means of processing circuitry. The processing circuitry may comprise hardware (e.g., one or more processing units and a non-volatile memory) and software (e.g., program code stored in the memory, for execution by the one or more processing units). The processor units may include dedicated hardware components such as a DSP, an FFT unit, a filtering unit, a beamforming unit, and further units to perform the above described processing of the audio signals. Alternatively or additionally, the processor units may be configured to execute instructions stored on the memory for performing the operations described above. The generalized beamformer may in particular include processor-readable instructions encoded on a memory directing the processor unit to perform the above described methods.

The present disclosure further provides a non-transitory computer-readable medium storing instructions that, when performed by a processor, cause the processor to perform a method according to any of the above-described embodiments.

Instructions or software to control a processor to perform the methods as described above may be written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the processor to operate as a machine or special-purpose computer to perform the operations of the methods as described above. In one example, the instructions or software may include machine code that is directly executed by the processor, such as machine code produced by a compiler. In another example, the instructions or software may include higher level code that is executed by the processor using an interpreter. Programmers of ordinary skill in the art can readily write the instructions or software based on the description of the methods provided herein.