Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ESTIMATION OF SIGNAL FROM NOISY OBSERVATIONS
Document Type and Number:
WIPO Patent Application WO/2006/114100
Kind Code:
A1
Abstract:
The invention provides a method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain. Preferably the signal dependencies in frequency domain include inter-frequency correlation. Preferably the method provides the statistical model by decomposing the noisy observations into a source model and a filter model, and preferably the source model includes performing a multi-pulse coding. Preferred embodiments include performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra. The invention also provides a noise reduction method and a speech enhancement method utilizing the signal estimation method. In addition, the invention provides a device including a processor adapted to perform the defined methods. The device may be such as a hearing aid, a headset, a mobile phone or the like. The methods provide a low computation complexity and are thus suited for e.g. miniature equipment with limited signal processing power, since computation complexity can be adjusted by tuning the number of spectral components to be included in the estimate of each component.

Inventors:
ANDERSEN SOEREN VANG (DK)
LI CHUNIJAN (DK)
Application Number:
PCT/DK2006/000220
Publication Date:
November 02, 2006
Filing Date:
April 26, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV AALBORG (DK)
ANDERSEN SOEREN VANG (DK)
LI CHUNIJAN (DK)
International Classes:
G10L21/02; G10L21/0208; G10L21/0264
Domestic Patent References:
WO2000057402A12000-09-28
Foreign References:
US5007094A1991-04-09
Other References:
CHUNJIAN LI ET AL: "Inter-frequency dependency in mmse speech enhancement", SIGNAL PROCESSING SYMPOSIUM, 2004. NORSIG 2004. PROCEEDINGS OF THE 6TH NORDIC ESPOO, FINLAND 9-11 JUNE 2004, PISCATAWAY, NJ, USA,IEEE, 9 June 2004 (2004-06-09), pages 200 - 203, XP010732574, ISBN: 951-22-7065-X
CHUNJIAN LI ET AL: "Integrating Kalman filtering and multi-pulse coding for speech enhancement with a non-stationary model of the speech signal", SIGNALS, SYSTEMS AND COMPUTERS, 2004. CONFERENCE RECORD OF THE THIRTY-EIGHTH ASILOMAR CONFERENCE ON PACIFIC GROVE, CA, USA NOV. 7-10, 2004, PISCATAWAY, NJ, USA,IEEE, 7 November 2004 (2004-11-07), pages 2300 - 2304, XP010781136, ISBN: 0-7803-8622-1
Attorney, Agent or Firm:
PLOUGMANN & VINGTOFT A/S (P.O. Box 831, Copenhagen Ø, DK)
Download PDF:
Claims:
27Claims
1. I. A method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain. 5 2.
2. Method according to claim 1, including the step of providing the statistical model by decomposing the noisy observations into a source model and a filter model.
3. Method according to claim 2, wherein the source model includes performing a multi pulse coding (MPLPC).*& 10.
4. Method according to any of the preceding claims, wherein the signal dependencies in frequency domain include interfrequency dependency.
5. Method according to claim 4, wherein the interfrequency dependency includes inter 15 frequency correlation.
6. Method according to any of the preceding claims, wherein the method includes performing a linear minimum mean squared error estimation.
7. 20 7.
8. Method according to claim 6, wherein the method includes performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra.
9. A noise reduction method including performing the method according to any of the preceding claims and providing a noise suppressed signal based on an output therefrom.*& 25.
10. Speech enhancement method including performing the noise reduction method according to claim 8 on speech present in noisy observations so as to enhance speech by suppressing noise.
11. 30 10. Device including a processor adapted to perform the method according to any of the preceding claims. II. Device according to claim 10, the device being selected from the group consisting of: a mobile phone, a radio communication device, an internet telephony system, sound 35 recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.
12. 12 Device according to claim 10, the device being selected from the group consisting of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and 40 a headphone with a builtin microphone.
13. 13 Computer executable program code adapted to perform the method according to any of claims 19.
Description:
ESTIMATION OF SIGNAL FROM NOISY OBSERVATIONS

Field of the invention

The invention relates to the field of signal processing, more specifically to processing aiming at estimating of a signal from noisy observations, e.g. with the aim of noise reduction or enhancing speech contained in a noisy signal. The invention provides a method and a device, e.g. a headset, adapted to perform the method.

Background of the invention

Noise reducing methods, i.e. methods aiming at processing a noisy signal with the purpose of suppressing the noise, are important parts of e.g. modern hearing aids, headsets, mobile phones and the like. In such devices noise reduction techniques are used for speech enhancement, e.g. to improve speech intelligibility of speech contained in noise.

However, for applications within hearing aids and other miniature devices with limited signal processing power, a low complexity of the noise reduction algorithm is required to obtain a given amount noise reduction which can be performed in real time.

In prior art a large amount of single channel noise reduction methods exist. Earlier spectral subtraction based methods were often used. However, in recent years two classes of methods have attracted attention since they provide superior performance compared to spectral subtraction methods: 1) frequency domain block minimum mean squared error (MMSE) based methods, and 2) signal subspace based methods. The MMSE based methods all rely on an assumption of quasi-stationarity and an assumption of uncorrelated spectral components in the signal, i.e. short time processing is required. Signal subspace based methods also assumes stationarity within a short frame.

For example for noise reduction of noisy signals containing speech the short time processing is a disadvantage since voiced speech can not be properly modelled, and thus existing methods do not offer optimal signal estimation for speech signals.

Summary of the invention Thus, it may be seen as an object of the present invention to overcome the mentioned problem with prior art signal estimation methods that are not suited for modelling voiced parts of speech due to the requirement of short time processing.

In a first aspect, the invention provides a method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain.

By exploiting signal dependencies in frequency domain, especially correlation between spectral components, provides a method suited for voiced speech. This is caused by prominent temporal power localization in the excitation of voiced speech. Thus the method

of the first aspect is suited for noise reduction in noisy speech signals or for speech enhancement. In addition, preferred embodiments provide a performance to complexity trade-off that makes them suited for resource-limited applications such as hearing aids by tuning a number of spectral components to be included in the estimate of each component.

It is to be understood that by "in frequency domain" is also included implementations of the method where the statistical model of signal dependencies is employed in time domain, since it is well-known that there is a duality between time and frequency domain and equivalent operations can be performed in either of these domains by proper transformation between the domains.

In preferred embodiments, the method includes the step of providing the statistical model by decomposing the noisy observations into a source model and a filter model, and preferably the source model includes performing a multi-pulse coding. This is advantageous for signal processing implementation purposes, since this operation results in a non-Toeplitz temporal signal covariance or a non-diagonal spectral signal covariance matrix in contrast to prior art quasi-stationary algorithms that result in a Toeplitz matrix.

Preferably the signal dependencies in frequency domain include inter-frequency dependency, most preferably inter-frequency correlation, phase structure, and/or non- stationarity of the signal.

Preferably the method includes performing a linear minimum mean squared error estimation, preferably including performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra.

In a second aspect, the invention provides a noise reduction method including performing the method according to the first aspect, and providing a noise suppressed signal based on an output therefrom.

Thus, the noise reduction method of the second aspect have the same advantages as mentioned for the first aspect, and it is understood that the preferred embodiments described for the first aspect apply for the second aspect as well.

The method is suited for a number of purposes where it is desired to perform a reduction of noise of a noisy signal, in general the method is suited to reduce noise by processing a noisy signal, i.e. an information signal corrupted by noise, and returning a noise suppressed signal. The signal may in general represent any type of data, e.g. audio data, image data, control signal data, data representing measured values etc. or any combination thereof. Due to the computational efficiency, the method is suited for on-line applications where limited signal processing power is available.

In a third aspect, the invention provides a speech enhancement method including performing the noise reduction method according to the second aspect on speech present in noisy observations so as to enhance speech by suppressing noise.

Thus, being based on the first and second aspects, the speech enhancement method of the third aspect have the same advantages as mentioned for the first and second aspects, and the preferred embodiments mentioned for the first aspect therefore also apply.

The speech enhancement method is suited for application where a noisy audio signal containing speech is corrupted by noise. The noise may be caused by electrical noise interfering with an electrical audio signal, or the noise may be acoustic noise such as introduced at the recording of the speech, e.g. a person speaking in a telephone at a place with traffic noise etc. The speech enhancement method can then be used to increase speech intelligibility by enhancing the speech in relation to the noise.

In a fourth aspect the invention provides a device including a processor adapted to perform the method of any one of the first, second or third aspects. Thus, the advantages and embodiments mentioned for the first, second and third aspects therefore apply for the fourth aspect as well. Due to the computational efficiency of the proposed methods, the signal processing power of the processor is relaxed.

Especially, the device may be: a mobile phone, a radio communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, or a monitoring system.

Alternatively, the device may be: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, or a headphone with a built-in microphone (so as to allow sound from the environments to reach the listener).

In a fifth aspect, the invention provides a computer executable program code adapted to perform the method according to any one of the first, second or third aspects. Thus, the same advantages as mentioned for these aspects therefore apply.

The program code may be present on a program carrier, e.g. a memory card, a disk etc. or in a RAM or ROM memory of a device.

Brief description of the drawings

In the following the invention is described in more details with reference to the accompanying figures of which

Fig. 1 shows graphs illustrating multi-pulse linear prediction coding, and

Fig. 2 shows a block diagram of a preferred device.

While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Description of preferred embodiments In the following, specific embodiments of the invention are illustrated in details referring to a preferred noise reduction/parametric speech enhancement algorithm based on a block based linear MMSE (LMMSE) noise reduction with a high temporal resolution modelling of speech excitation. The algorithm aim at a joint LMMSE estimation of signal power spectra and phase spectra, as well as exploitation of correlation between spectral components. The major cause of this inter-frequency correlation is shown to be the prominent temporal power localization in the excitation of voiced speech. LMMSE estimators in time domain and frequency domain are first formulated. To obtain the joint estimator, the spectral signal covariance matrix instead of a diagonal covariance matrix as is the case in the Wiener filter derived under quasi-stationary assumption. To accomplish this, the signal covariance matrix is decomposed into a source model (excitation filter matrix) and a filter model (synthesis filter matrix). The synthesis filter matrix is built from estimates of coefficients based on an all-pole model, and the excitation matrix is built from estimates of the instantaneous power of the excitation sequence. A decision-directed power spectral subtraction method and a modified multi-pulse linear predictive coding (MPLPC) method are used in these estimations, respectively.

The spectral domain formulation of the LMMSE estimator reveals important insight into inter-frequency correlations. This is exploited to significantly reduce computational complexity of the estimator, thus making the method suited for resource-limited applications such as hearing aids and the like. Performance-to-complexity trade-off can be conveniently adjusted by tuning the number of spectral components to be included in the estimate of each component. Experiments show that the proposed algorithm is able to reduce more noise than a number of other approaches selected from the state of the art. The proposed algorithm improves segmental signal to noise ratio of a noisy signal by 13 dB for a white noise case with an input of a 0 dB signal to noise ratio.

A description of the specific embodiments now follows, referring also to Fig. 1 illustrating multi-pulse coding.

Noise reduction is becoming an important function in hearing aids in recent years thanks to the application of powerful DSP hardware and the progress of noise reduction algorithm design. Noise reduction algorithms with high performance-to-complexity ratio have been the subject of extensive research study for many years. Among many

5 different approaches, two classes of single-channel speech enhancement methods have attracted significant attention in recent years because of their better performance compared to the classic spectral subtraction methods (A comprehensive study of Spectral Subtraction methods can be found in [6]). These two classes are the frequency domain block based Minimum Mean Squared Error (MMSE) approach and the signal

(0 subspace approach. The frequency domain MMSE approach includes the non-causal IIR Wiener filter [20], the MMSE Short-Time Spectral Amplitude (MMSE-STSA) estimator [10], the MMSE Log-Spectral Amplitude (MMSE-LSA) estimator [11], the Constrained Iterative Wiener Filtering (CIWF) [15], and the MMSE estimator using non-Gaussian priors [21]. These MMSE algorithms all rely on an assumption of quasi-

IS stationarity and an assumption of uncorrelated spectral components in the signal. The quasi-stationarity assumption requires short time processing. At the same time, the assumption of uncorrelated spectral components can be warranted by assuming the signal to be infinitely long and wide-sense stationary [8] [14]. This infinite data length assumption is in principle violated when using the short-time processing, although the tO effect of this violation may be minor (and is not the major issue this paper addresses) . More importantly, the wide-sense stationarity assumption within a short frame does not well model the prominent temporal power localization in the excitation source of voiced speech due to the impulse train structure. This temporal power localization within a short frame can be modeled as a non-stationarity of the signal that is not

%5 resolved by the short-time processing. In [18], we show how voiced speech is advantageously modeled as non-stationary even within a short frame, and that this model implies significant inter-frequency correlations. As a consequence of the stationarity and long frame assumptions, the MMSE approaches model the frequency domain signal covariance matrix as a diagonal matrix.

Another class of speech enhancement methods, the signal subspace approach, implicitly exploits part of the inter-frequency correlation by allowing the frequency domain signal covariance matrix to be non-diagonal. This class includes the Time Domain Constraint (TDC) linear estimator and Spectral Domain Constraint (SDC) linear es- timator [12], and the Truncated Singular Value Decomposition (TSVD) estimator [9]. In [12], the TDC estimator is shown to be an LMMSE estimator with adjustable input noise level. When the TDC filtering matrix is transformed to the frequency domain, it is in general non-diagonal. Nevertheless, the known signal subspace based methods still assume stationarity within a short frame. This can be seen as follows. In TDC and SDC 0 the noisy signal covariance matrices are estimated by time averaging of the outer product of the signal vector, which requires stationarity within the interval of averaging. The TSVD method applies singular value decomposition to the signal matrix instead. This can be shown to be equivalent to the eigen decomposition of the time averaged outer product of signal vectors. Compared to the mentioned frequency domain MMSE iζ approaches, the known signal subspace methods implicitly avoid the infinite data length assumption, so that the inter-frequency correlation caused by the finite length effect is accommodated. However, the more important cause of inter-frequency correlation, i.e., the non-stationarity within a frame is not modeled.

In terms of exploiting the masking property of the human auditory system, the

20 above mentioned frequency domain MMSE algorithms and signal subspace based algorithms can be seen as spectral masking methods without explicit modeling of masking thresholds. To see this, observe that the MMSE approaches shape the residual noise (the remaining background noise) power spectrum to one more similar to the speech power spectrum, thereby facilitating a certain degree of masking of the noise. In ge-

25 neral, the MMSE approaches attenuate more in the spectral valleys than the spectral subtraction methods do. Perceptually, this is beneficial for high pitch voiced speech, which has sparsely located spectral peaks that are not able to mask the spectral valley sufficiently. The signal subspace methods in [12] are designed to shape the residual noise power spectrum for a better spectral masking, where the masking threshold is > 0 found experimentally. Auditory masking techniques have received increasing attention in recent research of speech enhancement [2,26,29]. While the majority of these works

focus on spectral domain masking, the work in [24] shows the importance of the temporal masking property in connection with the excitation source of voiced speech. It is shown that noise between the excitation impulses is more perceivable than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation

S impulses locates temporally sparsely. This temporal masking property is not employed by current frequency domain MMSE estimators and the signal subspace approaches.

In this paper, we develop an LMMSE estimator with a high temporal resolution modeling of the excitation of voiced speech, aiming for modeling a certain non-stationarity of the speech within a short frame, which is not modeled by quasi-stationarity based

|0 algorithms. The excitation of voiced speech exhibits prominent temporal power localization, which appears as an impulse train superimposed with a low level noise floor. We model this temporal power localization as a non-stationarity. This non-stationarity causes significant inter-frequency correlation. Our LMMSE estimator therefore avoids the assumption of uncorrelated spectral components, and is able to exploit the inter-

/ ξ frequency correlation. Both the frequency domain signal covariance matrix and filtering matrix are estimated as complex-valued full matrices, which means that the information about inter-frequency correlation are not lost and the amplitude and phase spectra are estimated jointly. Specifically, we make use of the linear prediction based source-filter model to estimate the signal covariance matrix, upon which a time domain or frequency

20 domain LMMSE estimator is built. In the estimation of the signal covariance matrix, this matrix is decomposed into a synthesis filter matrix and an excitation matrix. The synthesis filter matrix is estimated by a smoothed power spectral subtraction method followed by an autocorrelation Linear Predictive Coding (LPG) method. The excitation matrix is a diagonal matrix with the instantaneous power of the LPC residual as

2.5 its diagonal elements. The instantaneous power of the LPC residual is estimated by a modified Multi-Pulse Linear Predictive Coding (MPLPC) method. Having estimated the signal covariance matrix, we use it in a vector LMMSE estimator. We show that by doing the LMMSE estimation in the frequency domain instead of in time domain, the computational complexity can be reduced significantly due to the fact that the

30 signal is less correlated in the frequency domain than in the time domain. Compared to several quasi-stationarity based estimators, the proposed LMMSE estimator results

in a lower spectral distortion to the enhanced speech signal while having higher noise reduction capability. The algorithm, applies more attenuation in the valleys between pitch impulses in time domain, while small attenuation is applied around the pitch impulses. This arrangement exploits the temporal masking effect, and results in a bets ' ter preservation of abrupt rise of the waveform amplitude while maintaining a large amount of noise reduction.

The rest of this paper is organized as follows. In Section 0.2 the notations and assumptions used in the derivation of LMMSE estimators are outlined. In Section 0.3, the non-stationary modeling of the signal covariance matrices is described. The 10 algorithm is summarized in Section 0.4. In Section 0.5, the computational complexity of the algorithm is reduced by identifying an interval of significant correlation and by simplifying the modified MPLPC procedure. Experimental settings, objective, and subjective results are given in Section 0.6. Finally, Section 0.7 discusses the obtained results.

I S " In this section, notations and statistic assumptions for the derivation of LMMSE estimators in time and frequency domain are outlined. Time domain LMMSE estimator

Let y(n, k), s(n, k), v(n, k) denote the n'th sample of noisy observation, speech, and additive noise (uncorrelated with the speech signal) of the fc'th frame, respectively.

O Then y(n, k) = s(n, k) + v(n, k). Alternatively, in vector form we have

y - s + v, (1) where boldface letters represent vectors and the frame indices are omitted to allow a compact notation.

For example y = [y(l, k), y(2, k), • • • , y(N, k)] τ is the noisy signal vector of the fc'th frame, where N is the number of samples per frame.

1$ To obtain linear MMSE estimators, we assume zero mean Gaussian PDF's for the noise and the speech processes. Under this statistic model the LMMSE estimate of the

signal is the conditional mean [16]

where C s and C v are the covariance matrices of the signal and the noise, respectively. The covariance matrix is defined as C s = E[ss H ], where (-) H denotes Hermitian transposition and E[-] denotes the ensemble average operator. 5 Frequency domain LMMSE estimator and Wiener filter

In the frequency domain the goal is to estimate the complex DFT coefficients given a set of DFT coefficients of the noisy observation. Let Y(m, k), θ(m, k), and V(m, k) denote the m'th DFT coefficient of the Ar'th frame of the noisy observation, the signal, and the noise, respectively. Due to the linearity of the DFT operator, we have,

Y(m, k) = θ(m, k) + V(m, k). (3)

I b In vector form we have

Y = 0 + V, (4) where again boldface letters represent vectors and the frame indices are omitted. As an example, the noisy spectrum vector of the fc'th frame is arranged as Y = [Y(I, k), Y(2, k), ■ ■ • , Y(N, k)] T where the number of frequency bins is equal to the number of samples per frame N.

IS We again use the linear model. Y, θ, and V are assumed to be zero-mean complex Gaussian random variables and θ and V are assumed to be uncorrelated to each other.

The LMMSE estimate is the conditional mean θ = E[Θ\Y]

(5)

= C θ (Cø + Cv)- 1 Y, where C # and Cy are the covariance matrices of the DFT coefficients of the signal and the noise, respectively. By applying inverse DFT to each side, (5) can be easily shown IP to be identical to (2).

The relation between the two signal covariance matrices in time and frequency domain is

C = FC 8 F- 1 , (6) where F is the Fourier matrix. If the frame was infinitely long and the signal was stationary, C 8 would be an infinitely large Toeplitz matrix. The infinite Fourier ma's " trix is known to be the eigenvector matrix of any infinite Toeplitz matrix [14]. Thus, Ce becomes diagonal and the LMMSE estimator (5) reduces to the non-causal HR

Wiener filter with the transfer function

H π WF ([,ω Λ) - - Pssjω) , ( , 7 7JΛ

* ss \ U) + r υυ {ω) where P ss (ω) and P υυ (ω) denotes the power spectral density (PSD) of the signal and the noise, respectively. In the sequel we refer to (7) as the Wiener filter or WF.

High temporal resolution modeling for the signal covariance matrix estimation

I O For both time and frequency domain LMMSE estimators described in Section 0.2, the estimation of the signal covariance matrix C s is crucial. In this work, we assume the noise to be stationary. For the signal, however, we propose the use of a high temporal resolution model to capture the non-stationarity caused by the excitation power variation. This can be explained by examining the voice production mechanism. In the well

)5 known source-filter model for voiced speech, the excitation source models the glottal pulse train, and the filter models the resonance property of the vocal tract. The vocal tract can be viewed as a slowly varying part of the system. Typically in a duration of 20 to 30 ms it changes very little. The vocal folds vibrate at a faster rate producing periodic glottal flow pulses. Typically there can be 2 to 8 glottal pulses in 20 ms. In 2.0 speech coding, it is common practice to model this pulse train by a long-term correlation pattern parameterized by a long-term predictor [4] [3] [5]. However, this model fails to describe the linear relationship between the phases of the harmonics. That is, the long term predictor alone does not model the temporal localization of power in the excitation source. Instead, we apply a time envelope that captures the localization and

concentration of pitch pulse energy in the time domain. This, in turn, introduces an element of non-stationarity to our signal model because the excitation sequence is now modeled as a random, sequence with time varying variance, i.e., the glottal pulses are modeled with higher variance and the rest of the excitation sequence is modeled with ζ lower variance. This modeling of non-stationarity within a short frame implies a temporal resolution much finer than that of the quasi-stationarity based algorithms. The latter has a temporal resolution equal to the frame length. Thus we term the former the high temporal resolution model. It is worth noting that some unvoiced phonemes, such as plosives, have very fast changing waveform envelopes, which also could be modeled

)5 as non-stationarity within the analysis frame. In this paper, however, we focus on the non-stationary modeling of voiced speech.

Modeling signal covariance matrix

The signal covariance matrix is usually estimated by averaging the outer product of the signal vector over time. As an example this is done in the signal subspace approach [12]. This method assumes ergodicity of the autocorrelation function within \ξ the averaging interval.

Here we propose the following method of estimating C s with the ability to model a certain element of non-stationarity within a short frame. The following discussion is only appropriate for voiced speech. Let r denote the excitation source vector, and H denote the synthesis filtering matrix corresponding to the vocal tract filter such as

0 0 0

Mi) MO) 0

H — M2) Ki) KO)

h(N - l) h{N - 2) ■ ■ ■ h{0)

26 where h(n) is the impulse response of the LPC synthesis filter.

We then have

s = Hr, (8)

and therefore

C s = E[ss H ] = HC r H", (9) where C 1 . is the covariance matrix of the model residual vector r. In (9) we treat H as a deterministic quantity. This simplification is common practice also when the LPC filter model is used to parameterize the power spectral density in classic Wiener

5 filtering [19] [15]. Section 0.3.2 addresses the estimation of H. Note that (8) does not take into account the zero-input response of the filter in the previous frame. Either i9β the zero-input response can be subtracted prior to the estimation of each frame, or a windowed overlap-add procedure can be applied to eliminate this effect.

We now model r as a sequence of independent zero mean random variables. The

(0 covariance matrix C 1 . is therefore diagonal with the variance of each element of r as its diagonal elements. For voiced speech, except for the pitch impulses, the rest of the residual is of very low amplitude and can be modeled as constant variance random variables. Therefore, the diagonal of C r takes the shape of a constant floor with a few periodically located impulses. We term this the temporal envelope of the instantaneous

IS residual power. This temporal envelope is an important part of the new MMSE estimator because it provides the information of uneven temporal power distribution. In the following two subsections, we will describe the estimation of the spectral envelope and the temporal envelope respectively.

Estimating the spectral envelope

In the context of LPC analysis, the synthesis filter has a spectrum that is the enve-

1_0 lope of the signal spectrum. Thus, our goal in this subsection is to estimate the spectral envelope of the signal. We first use the Decision Directed method [10] to estimate the signal power spectrum and then use the autocorrelation method to find the spectral envelope.

The noisy signal power spectrum of the fc'th frame |Y(A;)| 2 is obtained by applying IS the DFT to the fc'th observation vector y(fc) and squaring the amplitudes. The Decision Directed estimate of the signal power spectrum of the /c'th frame, |0(£;)| 2 , is a weighted sum of two parts, the power spectrum of the estimated signal of the previous frame, \θ(k — 1)| 2 , and the power-spectrum-subtraction estimate of

the current frame's power spectrum :

I J(fc)| 2 0), (10) where a is a smoothing factor a G [0, 1], and i?[]V(fc)| 2 ] is the estimated noise power spectral density. The purpose of such a recursive scheme is to improve the estimate of the power spectrum subtraction method by smoothing out the random fluctuation in

5 the noise power spectrum, thus reduce the "musical noise" artifact [7]. Other iterative schemes with similar time or spectral constraints are applicable in this context. For a comprehensive study of constraint iterative filtering techniques, readers are referred to [15]. We now take the square-root of the estimated power spectrum and combine it with the noisy phase to reconstruct the so called intermediate estimate, which has

(0 the noise-reduced amplitude spectrum but noisy phase. An autocorrelation method LPC analysis is then applied to this intermediate estimate to obtain the synthesis filter coefficients.

Estimating the temporal envelope

We propose to use a modified MPLPC method to robustly estimate the temporal envelope of the residual power. MPLPC is first introduced by Atal and Remde [4]

[ ζ to optimally determine the impulse position and amplitude of the excitation in the context of analysis-by-synthesis linear predictive coding. The principle is to represent the LPC residual with a few impulses in which the locations and amplitudes (gains) of the impulses are chosen such that the difference between the target signal and the synthesized signal is minimized. In the noise reduction scenario, the target signal will tO be the noisy signal and the synthesis filter must be estimated from the noisy signal. Here, the synthesis filter is treated as known. For the residual of voiced speech, there is usually one dominating impulse in each pitch period. We first determine one impulse per pitch period, then model the rest of the residual as a noise floor with constant variance. In MPLPC the impulses are found sequentially [17]. The first impulse lo- 2.5 cation and amplitude is found by minimizing the distance between the synthesized signal and the target signal. The effect of this impulse is subtracted from the target signal and the same procedure is applied to find the next impulse. Because this way of finding impulses does not take into account the interaction between the impulses,

re-optimization of the impulse amplitudes is necessary every time a new impulse is found. The number of pitch impulses p in a frame is determined in the following way. p is first assigned an initial value equal to the largest number of pitch periods possible in a frame. Then p impulses are determined using the above mentioned method. Only the " impulses with an amplitude larger than a threshold are selected as pitch impulses. In our experiment, the threshold is set to 0.5 times the largest impulse amplitude in this frame. Having determined the impulses, a white noise sequence representing the noise floor of the excitation sequence is added into the gain optimization procedure together with all the impulses. We use a codebook of 1024 white Gaussian noise sequences in

|O the optimization. The white noise sequence that yields the smallest synthesis error to the target signal is chosen to be the estimate of the noise floor. This procedure is in fact a multi-stage coder with p impulse stages and one Gaussian codebook stage, with a joint re-optimization of gains. Detailed treatment of this optimization problem can be found in [22]. After the optimization, we use a flat envelope equal to the square of

/ 5 the gain of the selected noise sequence to model the variance of the noise floor. Finally, the temporal envelope of the instantaneous residual power is composed of the noise floor variance and the squared impulses. When applied to noisy signals, the MPLPC procedure can be interpreted as a non-linear Least Square fitting to the noisy signal, with the impulse positions and amplitudes as the model parameters.

The algorithm

2o Having obtained the estimate of the temporal envelope of the instantaneous residual power and the estimate of the synthesis filter matrix, we are able to build the signal covariance matrix in (9). The covariance matrix is used in the time LMMSE estimator (2) or in the spectral LMMSE estimator (5) after being transformed by (6).

The noise covariance matrix can be estimated using speech absent frames. Here, jζ we assume the noise to be stationary. For the time domain LMMSE estimator (2), if the noise is white, the covariance matrix C v is diagonal with the noise variance as its diagonal elements. In the case of colored noise, the noise covariance matrix is no longer diagonal and it can be estimated using the time averaged outer product of the noise vector. For the spectral domain LMMSE estimator (5), Cv is a diagonal matrix

with the power spectral density of the noise as its diagonal elements. This is due to the assumed stationarity of the noise 1 . In the special case where the noise is white, the diagonal elements all equal the variance of the noise.

We model the instantaneous power of the residual of unvoiced speech with a flat

5 envelope. Here, voiced speech is referred to as phonemes that require excitation from the vocal folds vibration, and unvoiced speech consists of the rest of the phonemes. We use a simple voiced/unvoiced detector that utilize the fact that voiced speech usually has most of its power concentrated in the low frequency band, while unvoiced speech has a relatively flat spectrum within 0 to 4kHz. Every frame is low pass filtered and then

I 0 the filtered signal power is compared with the original signal power. If the power loss is more than a threshold, the frame is marked as an unvoiced frame, and vice versa. Note however, that even for the unvoiced frames, the spectral covariance matrix is non- diagonal because the signal covariance matrix C s , built in this way, is not Toeplitz. Hereafter, we refer to the proposed approach as the Time-Frequency-Envelope MMSE

|§" estimator (TFE-MMSE), due to its utilization of envelopes in both time and frequency domain. The algorithm is summarized in Algorithm 1.

Reducing computational complexity

The TFE-MMSE estimators require inversion of a full covariance matrix C s or Cg. This high computational load prohibits the algorithm from real time application in hearing aids. Noticing that both covariance matrices are symmetric and positive definite,

Io Cholesky factorization can be applied to the covariance matrices, and the inversion can be done by inverting the Cholesky triangle. A careful implementation requires N 3 /3 operations for the Cholesky factorization [13] and the algorithm complexity is O(N S ). Another computation intensive part of the algorithm is the modified MPLPC method. In this section we propose simplifications to these two parts.

15 Further reduction of complexity for the filtering requires understanding of the inter-

1 In modeling the spectral covariance matrix of the noise we have ignored the inter-frequency correlations caused by the finite-length window effect. With typical window length, e.g. 15 to 30ms, the inter-frequency correlations caused by the window effect is less significant than those caused by the non-stationarity of the signal. This can be easily seen by examining a plot of the spectral covariance matrix.

Algorithm 1 TFE-MMSE estimator

I: Take the fc'th frame,

2: Estimate the noise PSD from the latest speech-absent frame.

3: Calculate the power spectrum of the noisy signal.

4: Do power spectrum subtraction estimation of the signal PSD, and refine the estimate using

Decision-Directed smoothing (eq.(lθ)).

5: Reconstruct the signal by combining the amplitude spectrum estimated by 4 and the noisy phase. 6: Do LPC analysis to the reconstructed signal. Obtain the synthesis filter coefficients, and form the synthesis matrix H. 7: IF the frame is voiced

Estimate the envelope of the instantaneous residual power using the modified MPLPC method. 8: IF the frame is unvoiced

Use a constant envelope for the instantaneous residual power. 9= ENDIF

10: Calculate the residual covariance matrix C 1 .. 11: Form the signal covariance matrix C 3 = HC f H^ (eq.(9)). 12: IF time domain LMMSE : s = C 3 (Cs + C v )-V (eq.(2)). 13: IF frequency domain LMMSE : transform C s to frequency domain Cg = FCgF ""1 , filter the noisy spectrum θ = C 9 (C 9 + Cv) -1 Y (eq.(5)), obtain the signal estimate by inverse DFT. 14: ENDIF 15: Calculate the power spectrum of the filtered signal, \θ(k — 1)| 2 , for use in the PSD estimation of next frame. 16: k = k + 1 and go to 1.

frequency correlation. In the time domain the signal samples are clearly correlated with each other in a very long span. However, in the frequency domain, the correlation span is much smaller. This can be seen from the magnitude plots of the two covariance matrices (see Fig.l). ζ For the spectral covariance matrix, the significant values concentrate around the diagonal. This fact indicates that a small number of diagonals capture most of the inter-frequency correlation. The simplified procedure is as follows. Half of the spectrum vector θ is divided into small segments of I frequency bins each. The sub-vector starting at the j'th frequency is denoted as θ subl3 , where j € [1, 1, 21, - • • , N/2] and

(£> Z <ξC N. The noisy signal spectrum and the noise spectrum can be segmented in the same way giving Y sub j a nd * V subj • The LMMSE estimate of θ subtJ needs only a block of the covariance matrix, which means that the estimate of a frequency component benefits from its correlations with I neighboring frequency components instead of all components. This can be written as

. / S The first half of the signal spectrum can be estimated segment by segment. The second half of the spectrum is simply a flipped and conjugated version of the first half. The segment length is chosen to be I = 8, which in our experience does not degrade performance noticeably when compared with the use of the full matrix. Other segmentation schemes are applicable, such as overlapping segments. It is also possible to use

2o a number of surrounding frequency components to estimate a single component at a time. We use the non-overlapping segmentation because it is computationally less expensive while maintaining good performance for small I. When the signal frame length is 128 samples and the block length is I = 8, using this simplified method requires only gj T j times of the original complexity for the filtering part of the algorithm with τ£ an extra expense of FFT operations to the covariance matrix. When I is set to values larger than 24, very little improvement in performance is observed. When I is set to values smaller than 8, the quality of enhanced speech degrades noticeably. By tuning the parameter I, an effective trade-off between the enhanced speech quality and the computational complexity is adjusted conveniently.

In the MPLPC part of the algorithm, the optimization of the impulse amplitude and the gain of the noise floor brings in heavy computational load. It can be simplified by fixing the impulse shape and the noise floor level. In the simplified version, the MPLPC method is only used for searching the locations of the p dominating impulses. Once the

S locations are found, a predetermined pulse shape is put at each location. An envelope of the noise floor is also predetermined. The pulse shape is chosen to be wider than an impulse in order to gain robustness against estimation error of the impulse locations. This is helpful as long as noise is present. The pulse shape used in our experiment is a raised cosine waveform with a period of 18 samples and the ratio between the pulse

It) peak and the noise floor amplitude is experimentally determined to be 6.6. Finally, the estimated residual power must be normalized. Although the pulse shape and the relative level of the noise floor are fixed for all frames, experiments show that the TFE- MMSE estimator is not sensitive to this change. The performance of both the simplified procedure and the optimum procedure are evaluated in Section 0.6. Fig.1 shows the

\S estimated envelopes of residual in the two ways.

Results

Objective performance of the TFE-MMSE estimator is first evaluated and compared with the Wiener filter [20], the MMSE-LSA estimator [11], and the signal subspace method TDC estimator [12]. For the TFE-MMSE estimator, both the complete algorithm and the simplified algorithms are evaluated. For all estimators the sampling frequency IQ is 8kHz, and the frame length is 128 samples with 50% overlap. In the Wiener filter we use the same Decision Directed method as in the MMSE-LSA and the TFE-MMSE estimator to estimate the PSD of the signal. An important parameter for the Decision Directed method is the smoothing factor a. The larger the a is, the more noise is removed and more distortion imposed to the signal, because of more smoothing made to 25 the spectrum. In the MMSE-LSA estimator with the aforesaid parameter setting, we found experimentally a — 0.98 to be the best trade-off between noise reduction and signal distortion. We use the same a for the WF and the TFE-MMSE estimator as for the MMSE-LSA estimator. For the TDC, the parameter μ (μ ≥ 1) controls the degree of over suppression of the noise power [12]. The larger the μ is, the more attenuation

to the noise but larger distortion to the speech. We choose μ = 3 in the experiments by balancing the noise reduction and signal distortion.

All estimators run with 32 sentences ftom different speakers (16 male and 16 female) from the TIMIT database [1] added with white Gaussian noise, pink noise, and car noise

5 in SNR ranging from 0 dB to 20 dB. The white Gaussian noise is computer generated, and the pink noise is generated by filtering white noise with a filter having a 3 dB per octave spectral power descend. The car noise is recorded inside a car with a constant speed. Its spectrum is more low pass than the pink noise. The quality measures used include the SNR, the segmental SNR, and the Log-Spectral Distortion (LSD). The SNR

|D is defined as the ratio of the total signal power to the total noise power in the sentence. The segmental SNR (segSNR) is defined as the average ratio of signal power to noise power per frame. To prevent the segSNR measure from being dominated by a few extreme low values, since the segSNR is measured in dB, it is common practice to apply a lower power threshold e to the signals. Any frame that has an average power lower

\S than e is not used in the calculation. We set e to 4OdB lower than the average power of the utterance. The segSNR is commonly considered to be more correlated to perceived quality than the SNR measure. The LSD is defined as [27] :

where e is average power of the utterance. lO TFE-MMSEl is the complete algorithm, and TFE-MMSE2 is the one with simplified MPLPG and reduced covariance matrix (I = 8). It is observed that the TFE-MMSE2, although a result of simplification of TFE-MMSEl, has better performance than the TFE-MMSEl. This can be explained as follows : 1) Its wider pulse shape is more robust to the estimation error of impulse positions, and 2) the wider pulse shape can model

1$ to some extent the power concentration around the impulse peaks, which is overlooked by the spiky impulses. For this reason, in the following evaluations we investigate only the simplified algorithm.

Informal listening tests reveal that, although the speech enhanced by the TFE- MMSE algorithm has a significantly clearer sound (less muffled than the reference

algorithms), the remaining background noise has musical tones. A solution to the musical noise problem is to set a higher value to the smoothing factor a. Using a larger a sacrifices the SNR and LSD slightly at high input SNR's, but improves the SNR and LSD at low input SNR's, and generally improves the segSNR significantly. The musi- 5 " cal tones are also well suppressed. By setting a = 0.999, the residual noise is greatly reduced, while the speech still sounds less muffled than for the reference methods. The reference methods can not use a smoothing factor as high as the TFE-MMSE : experiments show that at a = 0.999 the MMSE-LSA and the WF result in extremely muffled sounds. The TDC also suffers from a musical residual noise. To suppress its IO residual noise level to as low as that of the TFE-MMSE with a = 0.999, the TDG requires a μ lager than 8. This causes a sharp degradation of the SNR and LSD, and results in very muffled sounds. The TFE-MMSE2 estimator with a large smoothing factor (a = 0.999) is hereafter termed TFE-MMSE3 and its objective measures are also shown in the figures. To verify the perceived quality of the TFE-MMSE3 subjectively, ) 5" preference test between the TFE-MMSE3 and the WF, and between the TFE-MMSE3 and the MMSE-LSA are conducted. The WF and the MMSE-LSA use their best value of smoothing factor (a — 0.98). The test is confined to white Gaussian noise and a limited range of SNR's. Three sentences by male speakers and three by female speakers at each SNR level are used in the test. Eight unexperienced listeners are required to 20 vote for their preferred method based on the amount of noise reduction and speech distortion. The utterances are presented to the listeners by a high quality headphone. The clean utterance is first played as a reference, and the enhanced utterances are played once, or more if the listener finds this necessary. The results in Table 1 and 2 show that : 1) at 10 dB and 15 dB the listeners clearly prefer the TFE-MMSE over the two Vo reference methods, while at 5 dB the preference on the TFE-MMSE is unclear ; 2) the TFE-MMSE method has a more significant impact on the processing of male speech than on the processing of female speech. At 10 dB and above, the speech enhanced by TFE-MMSE3 has barely audible background noise, and the speech sounds less muffled than the reference methods. There is one artifact heard in rare occasions that we 2 > t> believe is caused by remaining musical tones. It is of very low power and occur some times at speech presence. The two reference methods have higher residual background

noise and suffer from muffling and reverberance effects. When SNR is lower than 10 dB, a certain speech dependent noise occurs at speech presence in the TFE-MMSE3 processed speech. The lower the SNR is, the more audible this artifact is. Comparing the male and female speech processed by the TFE-MMSB3, the female speech sounds ζ a bit rough.

The algorithms are also evaluated for pink noise and car noise cases.

In these results the TDC algorithm is not included because the algorithm is proposed based on the white Gaussian noise assumption. Informal listening test shows that the perceptual quality in the pink noise case for all

Ib the three algorithms are very similar to the white noise case, and that in the car noise case all tested methods have very similar perceptual quality.

TAB. 1 - Preference test between " WF and TFE-MMSE3 with additive white Gaussian noise.

TAB. 2 - Preference test between MMSE-LSA and TFE-MMSE3 with additive white Gaussian noise.

Discussion

The results show that for male speech_the TFE-MMSE3 estimator has the best

performance in all the three objective measures (SNR, segSNR, and LSD). For female speech, the TFE-MMSE3 is the second in SNR, the best in LSD, and among the best in segSNR. The TFE-MMSE3 estimator allows a high degree of suppression to the noise while maintaining low distortion to the signal. The speech enhanced by the TFE- £ " MMSE3 has a very clean background and a certain speech dependent residual noise. When the SNR is high (10 dB and above), this speech dependent noise is very well masked by the speech, and the resulting speech sounds clean and clear. As spectrograms, indicate;-, the clearer sound is due to a better preserved signal spectrum, and a more suppressed background noise. At SNR lower than 5 dB, although the background IO still sounds clean, the speech dependent noise becomes audible, and perceived as a i distortion to the speech.The listeners preference start shifting from the TFE-MMSE3 towards the MMSE-LSA that has a more uniform residual noise, although the noise level is high. The conclusion here is that at high SNR, it is preferable to remove background noise completely using the TFE-MMSE estimator without major distortion to Jg the speech. This could be especially helpful at relieving listening fatigue for the hearing aid user. Whereas, at low SNR it is preferable to use a noise reduction strategy that produces uniform background noise, such as the MMSE-LSA algorithm.

The fact that female speech enhanced by the TFE-MMSE estimator sounds a little rougher than the male speech is consistent with the observation in [24], where male

2£) voiced speech and female voiced speech are found to have different masking properties in the auditory system. For male speech, the auditory system is sensitive to high frequency noise in the valleys between the pitch pulse peaks in the time domain. For the female speech, the auditory system is sensitive to low frequency noise in the valleys between the harmonics in the spectral domain. While the time domain valley for the male speech

IS is cleaned by the TFE-MMSE estimator, the spectral valleys for the female speech are not attenuated enough ; a comb filter could help to remove the roughness in the female voiced speech.

In the TFE-MMSE estimator, we apply a high temporal resolution non-stationary model to explain the pitch impulses in the LPC residual of voiced speech. This en-

JO ables the capture of abrupt changes in sample amplitude that are not captured by an AR linear stochastic model. In fact, the estimate of the residual power envelope

contains information about the uneven distribution of signal power in time aids.

It can be observed that by a better model of temporal power distribution the TFE-MMSB estimator represents the sudden rises of amplitude better than the Wiener filter.

S Noise in the phase spectrum is reduced by the TFE-MMSE estimator. Although human ears are less sensitive to phase than, to power, it is found in recent work [28] [23]

[25] that phase noise is audible when the. source SNR is very low. In [28] a threshold of phase perception is found. This phase-noise tolerance threshold corresponds to an

SNR threshold of about 6 dB, which means for spectral components with local SNR

|0 smaller than 6 dB, it is necessary to reduce phase noise. The TFE-MMSE estimator has the ability of enhancing phase spectra because of its ability to estimate the temporal localization of residual power. It is the linearity in the phase of harmonics in the residual that makes the power be concentrated at periodic time instances, thus producing pitch pulses. Estimating the residual power temporal envelope enhances the linearity of the

Ig phase spectrum of the residual and therefore reduces phase noise in the signal.

References

[1] DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. CD-ROM, 1990. [2] K. Arehart, J. Hansen, S. Gallant, and L. Kalstein. Evaluation of an auditory masked threshold noise suppression algorithm in normal-hearing and hearing impaired listeners. Speech Communications, 40, no.4 :575-592, September 2003. [3] B. Atal. Predictive Coding of Speech at Low Bit Rate. IEEE Trans, on Comm., pages 600-614, April 1982. [4] B. Atal and J. Remde. A new model of LPC excitation for producing natural sounding speech at low Wt rates. Proc. of ICASSP 1982, 7 :614-617, May 1982. [5] B. S. Atal and M. R. Schroeder. Adaptive predictive coding of speech signals. Bell

Syst. Techn. J., 49 :1973-1986, 1970. [6] S. F. Boll. Suppression of Acoustic Noise in Speech Using Spectral Subtraction.

IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, No. 2 :113-120, April

1979. [7] O. Cappe. Elimination of the Musical Phenomenon with the Ephraϊm and Malah

Noise Suppressor. IEEE Trans. Acoust., Speech, Signal Processing, 2 :345-349,

April 1994. [8] W. B. Davenport and W. L. Root. An Introduction to the Theory of Random

Signals and Noise. New York : McGraw-Hill, 1958. [9] M. Dendrinos, S. Bakamidis, and G. Carayannis. Speech Enhancement from Noise :

A Regenerative Approach. Speech Communication, 10 :45-57, February 1991. [10] Y. Ephraim and D. Malah. Speech Enhancement Using a Minimum Mean-Square

Error Short-Time Spectral Amplitude Estimator. IEEE Trans, on Acoustics,

Speech, and Signal Processing, ASSPr32 :1109-1121, December 1984.

[11] Y. Ephraim and D. Malah. Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator. IEEE Trans, on Acoustics, Speech, and Signal Processing, ASSP-33 :443-445, April 1985.

[12] Y. Ephraim and H. L. Van Trees. A Signal Subspace Approach for Speech Enhancement. IEEE Tran. Speech and Audio Processing, 3 :251-266, July 1995.

[13] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.

[14] R. M. Gray. Toeplitz and circulant matrices : A review, 2002.

[15] J. H. L. Hansen and M. A. Clements. Constrained Iterative Speech Enhancement with Application to Automatic Speech Recognition. IEEE Trans. Signal Processing, 39 no.4 :795-805, April 1991.

[16] S. M. Kay. Fundamentals of Statistical Signal Processing, Estimation Theory. Prentice Hall PTR, 1993.

[17] A. M. Kondoz. Digital Speech, Coding for Low Bit Rate Communications Systems. John Wiley & Sons, 1999.

[18] C. Li and S. V. Andersen. Inter-frequency Dependency in MMSE Speech Enhancement. Proceedings of the 6th Nordic Signal Processing Symposium, June 2004.

[19] J. S. Lim and A. V. Oppenheim. All-pole Modeling of Degraded Speech. IEEE Trans. Acoust., Speech, Signal Processing, ASP-26 :197-209, June 1978.

[20] J. S. Lim and A. V. Oppenheim. Enhancement and Bandwidth Compression of Noisy Speech. Proceedings of the IEEE, 67 :1586-1604, December 1979.

[21] R. Martin. Speech Enhancement Using MMSE Short Time Spectral Estimation With Gamma Distributed Speech Priors. Proc.of ICASSP 2002, 1 :253-256, May 2002.

[22] N. Moreau and P.Dymarski. Selection of excitation vectors for the CELP coders. IEEE Trans, on Speech and Audio Processing, 2(1) :29-41, January 1994.

[23] H. Pobloth and W. B. Kleijn. On Phase Perception in Speech. Proc.of ICASSP 1999, 1 :29-32, March 1999.

25 b

[24] J. Skoglund and W. Bastiaan Kleijn. On Time-Frequency Masking in Voiced Speech. IEEE Trans. Speech and Audio Processing, 8, No.4 :361-369, July 2000.

[25] J. Skoglund, W. Bastiaan Kleijn, and P. Hedelin. Audibility of Pitch- Synchronously Modulated Noise. Speech Coding For Telecommunications Proceeding ; IEEE, 7-10 -.51-52, September 1997.

[26] D. Tsoukalas, J. Mourjoupoulos, and G. Kokkinakis. Speech enhancement based on audible noise suppression. IEEE Trans, on Speech and Audio Processing, 5(6) :497- 514, November 1997.

[27] J-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for seperation of simultaneous non-stationary source. ICASSP 2004, pages 1-221, 2004.

[28] P. Vary. Noise Suppression By Spectral Magnitude Estimation - Mechanism and Theoretical Limits. Signal Processing 8, pages 387-400, May 1985.

[29] N. Virag. Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans, on Speech and Audio Processing, 7,no.2 :126- 137, 1999.

26

Fig. 2 illustrates a block diagram of a preferred device embodiment. The illustrated device may be such as a mobile phone, a headset or a part thereof. The device is adapted to receive a noisy signal, e.g. an electrical analog or digital signal representing an audio signal containing speech and unintended noise. The device includes a digital signal processor DSP that performs a signal processing on the noisy signal. First, the signal estimation method is performed, preferably including a block based linear MMSE method including employing a statistical model of correlation between spectral components such as described in the foregoing. The signal estimation method serves as input to a noise reduction method as will also be understood from the foregoing. The output of the noise reduction method is a signal where the speech is enhanced in relation to the noise. This signal with enhanced speech is applied to a loudspeaker, preferably via an amplifier, so as to present an acoustic representation of the speech enhanced signal to a listener.

As mentioned, the device in Fig. 2 may be a hearing aid, a headset or a mobile phone or the like. In case of a headset, the DSP may either be built into the headset, or the DSP may be positioned remote from the headset, e.g. built into other equipment such as amplifier equipment. In case of a hearing aid, the noisy signal can originate from a remote audio source or from microphone built into the hearing aid.

Even though the described embodiments are concerned with audio signals, it is appreciated that principles of the methods described can be used for a large variety of applications for audio signals as well as other types of noisy signals.

It is to be understood that reference signs in the claims should not be construed as limiting with respect to the scope of the claims.