TRANSMISSION OF SPEECH CODING PARAMETERS WITH ECHO CANCELLATION

Title:

TRANSMISSION OF SPEECH CODING PARAMETERS WITH ECHO CANCELLATION

Document Type and Number:

WIPO Patent Application WO/2005/031706

Kind Code:

A1

Abstract:

Method for transmitting speech parameters representing an acoustic signal being encoded by an analysis through synthesis method wherein at least a part of the speech parameters are selected from fixed gains for describing the acoustic signal in a time interval, comprising the steps: Estimating the energy (E_l(k) ) of a received signal (x(n)) by using the fixed gain (g_l,f(k)) of the received signal (x(n)); estimating the energy of a signal to be sent (y(n)) by using the fixed gain (g_m,f(k)) of the signal to be sent; deriving a loudspeaker modification factor (a₁(k)) for modifying the fixed gain factor (g_1,f(k)) of the received signal, said loudspeaker modification factor being based on the energy of the received signal (x(n)) and the energy of the signal to be sent (y(n)); modifying the fixed gain (g_1,f(k)) of the received signal with the loudspeaker modification factor.

Inventors:

BEAUGEANT CHRISTOPHE (DE)
DUETSCH NICOLAS (DE)
HEISS HERBERT (DE)
TADDEI HERVE (DE)

Application Number:

PCT/EP2004/051748

Publication Date:

April 07, 2005

Filing Date:

August 09, 2004

Export Citation:

Click for automatic bibliography generation Help

Assignee:

SIEMENS AG (DE)
BEAUGEANT CHRISTOPHE (DE)
DUETSCH NICOLAS (DE)
HEISS HERBERT (DE)
TADDEI HERVE (DE)

International Classes:

G10L19/04; G10L21/02; H04M9/08; G10L21/0208; (IPC1-7): G10L19/04; G10L21/02; H04B3/20; H04M9/08

Domestic Patent References:

WO2002054744A1

2002-07-11

Foreign References:

US4609788A	1986-09-02
FR2748184A1	1997-10-31
EP1301018A1	2003-04-09
US5668794A	1997-09-16
US5353348A	1994-10-04
US6011846A	2000-01-04

Other References:

CHANDRAN R ET AL: "Compressed domain noise reduction and echo suppression for network speech enhancement", CIRCUITS AND SYSTEMS, 2000. PROCEEDINGS OF THE 43RD IEEE MIDWEST SYMPOSIUM, vol. 1, 8 August 2000 (2000-08-08) - 11 August 2000 (2000-08-11), LANSING, MI, USA, pages 10 - 13, XP010558066

Attorney, Agent or Firm:

SIEMENS AKTIENGESELLSCHAFT (München, DE)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1.

Method for transmitting speech parameters representing an acoustic signal being encoded by an analysis through synthesis method wherein at least a part of the speech parameters are selected from fixed codebook entries and respective fixed gains for describing the acoustic signal in a time interval, comprising the steps: a) Estimating the energy (El (k) ) of a received signal (x (n) ) by using the fixed gain (gi, f (k) ) of the received signal (x (n) ) ; b) Estimating the energy of a signal to be sent (y (n) ) by using the fixed gain (gm, f (k) ) of the signal to be sent; c) Deriving a loudspeaker modification factor (al (k) ) for modifying the fixed gain factor (gl, f (k) ) of the received signal, said loudspeaker modification factor being based on the energy of the received signal (x (n)) and the energy of the signal to be sent (y (n) ) ; d) Modifying the fixed gain (gi, f : (k) ) of the received signal with the loudspeaker modification factor.

2.

Method according to claim 1, wherein in step c) further a microphone modification factor (a", (k) ) for modifying the signal to be sent (y (n) is derived, said microphone modification factor being based on the energy of the signal to be sent (y (n) ) and the received signal (x (n)) and wherein the fixed gain (gnf (k) ) of the signal to be sent is modified with the microphone modification factor;.

3.

Method according to claim 1 or 2, wherein further an adaptive codebook and a respective adaptive gain are used to describe the acoustic signal and the estimation of the energy of the signal to be sent (y (n) ) and the received signal (x (n) ) is performed by including the adaptive codebook and the adaptive gain factor.

4.	Method according to any of the previous claims, wherein the energy estimate of the signal to be sent and the received signal is averaged and the averaged energies are used for deriving the loudspeaker modification facto and the microphone modification factor.

5.

Method according to any of the previous claims, wherein the estimating of the energy of the received signal is performed using the expression El(k) = Ef,cgl,f(k)+gl,a(k)El(k1) where El (k) is the energy of the received signal in the time interval k, wherein k is an integer number, Ef,. ; is a entry of the fixed codebook, g,, is the fixed gain of the received signal and El (kl) is the energy of the received signal in the previous time interval, and the energy of the signal to be sent is estimated by using the expression: Em(k) = Ef,cgm,f(k) + gm,a(k)Em(k1) where Em(k) is the energy of the signal to be sent in the time interval k, wherein k is an integer number, Ef,c is a entry of the fixed codebook, gl is the fixed gain of the received signal and Ern (k1) is the energy of the signal to be sent in the previous time interval.

6.	Method according to any of the previous claims, wherein the signal to be sent is recorded by a microphone and the received signal is reproduced by a loudspeaker.

7.	Method according to any of the previous claims, wherein the time interval is a frame or a subframe.

8.

Method according to any of the previous claims 4 to 7, wherein the averaging is performed using the following equation: (1ar)El(k) + ar#l(k1) if El(k) > #1(k1) #l(k) = { (1af)El(k) + af#l(k1) else wherein E, (k) is the estimated energy of the received signal, ar is a weighting factor applied to the estimate, if the energy Ê (k) is rising and Ei is the energy determined for a time interval according to claim 5, and further using the equation: (1ar)Em(k) + ar#m(k1) if Em(k) > #m(k1) #m(k) = { (1af)Em(k) + af#m(k1) else wherein F, (k) is the estimated energy of the received signal, a= is a weighting factor applied to the estimate, if the energy F, (k) is rising and Em is the energy determined for a time interval according to claim 5.

9.

Method according to any of the previous frames wherein the modification factor al for the received signal is determined by 0 Ediff(k) < m/2 al (k) =l/mÆ, (k) +0. 5m/2 < Ediff(k) < m/2 I m/2 < Ediff(k) wherein m is an integer number denoting an energy threshold, Ediff is the difference in energy between the received signal and the signal to be sent and k is an integer number denoting the time interval.

10.	Method wherein the difference in energy is determined <BR> <BR> <BR> <BR> El(k)<BR> <BR> <BR> <BR> by the ratio Ediff(k) = 10log Em(k) of the energies estimated according to claim 8.

11.	Method wherein the microphone modification am factor is determined by am(k) = 1al(k), wherein alis the loudspeaker modification factor.

12.	Speech coding apparatus with a processing unit set up for performing a method according to any of the claims 1 to 11.

13.	Communications device, particularly a mobile phone, with a speech coding apparatus according to claim 12.

14.	Communications network with a speech coding apparatus according to claim 12.

Description:

TRANSMISSION OF SPEECH CODING PARAMETERS WITH ECHO CANCELLATION Description The invention refers to a transmission method in an analysis by synthesis system.

Speech signals are often encoded using linear prediction.

This is e. g. realized by using CELP (Code Excited Linear Prediction) encoders, where a synthetically produced speech signal is approximated in respect with the original speech signal. The synthetic signal is described by so called codebooks, which contain a variety of excitation signals and respective gains for amplifying or attenuating the excitation signals.

The excitation signals are fed into a filter modelling the vowel tract of a human and the result of the filtering is compared with the original speech signal. The excitation signals and gains are chosen such that the synthetic signal approximates the original signal.

As codebooks mostly a so-called adaptive codebook for describing periodic components of speech and a fixed codebook are being used.

This will be described in greater detail in reference with Fig. 2 using an example from the AMR (Adaptive Multi Rate Codec, which can be used for GSM (Global System of Telecommunications) applications and which is mandatory for UMTS (Universal Mobile Telecommunications System).

First a model of a communication system derived from a end- to-end conversation of at least two communication devices will be described with reference to Fig. 1. The signal or the

voice of one speaker, the near-end-speaker B, is recorded by a microphone M and digitized for encoding the signal in a encoder as a preparation for a transmission. The encoding may comprise e. g. source and channel coding.

This coded signal is transmitted over a noisy channel, i. e. that the signal experiences degradation by the noise, to the counterpart of the conversation, where it is decoded in a decoder, and converted into an analog signal that can be spread over a loudspeaker L. The system can be considered as symmetric, thus only one side needs to be regarded.

The other part of the conversation, the far-end speaker A, is linked with speaker B through the so-called loudspeaker path, whereas the near-end speaker B is linked with the far-end speaker A through the microphone path.

Since speech signals are analog signals, they have to be sampled and converted back for coding reasons. In the framework of this application the index"t", which denotes a time instant explicitely, refers to analog signals whereas digital signals are noted by index"n"or"k", which denote a certain time interval, e. g. a frame or a subframe."k" references source coded signals, while the digital uncompressed representation of speech is indexed with"n".

The voice of the far-end speaker A is transmitted to the loudspeaker at speaker B and converted to sound waves, so that the near-end speaker is able to understand the other talker. In the same way, the voice of the near-end speaker is converted to a digital signal by the microphone and the A/D- transducer. Besides the near-end speaker's voice, environmental noise and reflected sound waves of the loudspeaker output are recorded by the microphone. Therefore

the microphone signal y (t) is a sum of the near-end speech s (t), of the noise n (t) and of the reflected signal or (acoustic) echo signal e (t). y (t) = s (t) +e (t) +n (t) (1) This is describing the so-called double-talk situation in contrast to the single-talk situation. In the following the noise n (t) is assumed to be 0.

The single-talk mode is characterized by only one person talking. If speaker B is talking, the microphone signal is y (t) s (t) C) (2) y (t) = e(t)s(t) = 0 (3) if only the far-end speaker is talking.

In double-talk both persons are talking simultaneously, so the microphone signal is the sum of the near-end speaker's voice and the echo of the far-end speaker's voice The main problem of acoustic echo e (t) is that the speaker's own voice is transmitted back to him. This does not affect the quality of the conversation as long as twice the transmission time (transmission over loudspeaker and microphone path) is less or equal the reflection time of the speaker's own voice in his LRM system. Then the echo is masked. Unfortunately in mobile telephony or satellite telephony the transmission time is usually much longer. The result is that the far-end speaker A hears his own voice with a delay of about 160 ms in case of fixed to mobile network transmission. This decreases greatly the quality of the

conversation, especially when the partner of the conversation is not talking.

In order to reduce the echo in the microphone signal y (t) a so called"gain-loss control"has been introduced, where the microphone signal y (t) is attenuated for speaker A if the near-end speaker B is talking. Thus less noise or echo will be transmitted from A to B while B is speaking.

In other words, the principle of this method is to attenuate the microphone signal if the far-end speaker is talking and to reduce the loudspeaker signal in power if the near-end speaker is talking. The effect is that only a strongly reduced echo signal is transmitted to the far-end speaker as either the loudspeaker or the microphone path is attenuated.

The disadvantage of the gain loss control however is its behavior in double-talk mode. Because in this mode the loudspeaker and the microphone signal do not differ much in their energy, the gain loss control attenuates both paths.

Thus conversation is only possible during single-talk mode without decreasing the understandability of the conversation.

The sound waves, which are excited by the diaphragm of the loudspeaker, spread in the LRM system with sonic speed. As they are reflected by objects and walls in this LRM system, there is a superposition of waves at the microphone. These sound waves are delayed and attenuated variably because of different propagation paths. So the echo signal e (t) is a sum of decreased and delayed signals x (t). This is modeled by the convolution of the loudspeaker signal $x (t) $ with the impulse response of the LRM system h (tau, t) e (t) = x (t) * h (T, t) (4)

where h (T, t) is the channel response at observation time t to an impulse at t-tau. The impulse response is time-varying, depending on the motion of people, objects, loudspeaker or microphone in the LRM system.

The proper attenuation of the microphone signal or the loudspeaker signal depending on the energy of the echo signal is, however, complex to calculate.

Based on the foregoing description it is therefore an object of the invention to provide the possibility of implementing a gain loss control that works safely and does not require a high complexity.

This object is solved by the subject matter disclosed in the independent claims. Advantageous embodiments of the present invention will be presented in the dependent claims.

In a method for transmitting speech data or parameter said speech data are encoded by using an analysis through synthesis method. For the analysis through synthesis a synthesised signal is produced for approximating the original signal. The production of the synthesised signal is performed by using at least a fixed codebook with a respective fixed gain and optionally an adaptive codebook and a adaptive gain.

The entries of the codebook and the gain are chosen such, that the synthesised signal resembles the original signal.

Parameters describing these quantities will be transmitted from a sender to a receiver, e. g. from a near-end speaker to a far-end speaker or vice versa. These parameters are part of the speech data.

The invention is based on the idea of introducing a gain loss control implemented in the parameter range. Therefor the energy of a received signal, which may be reproduced by a loudspeaker is estimated on basis of its fixed gain. As parameters used for the analsysis through synthesis method are transmitted from the far-end speaker, the fixed gain is available for the near-end speaker.

Furthermore the energy of a signal to be sent, of e. g. the near-end speaker, which may be recorded by using a microphone, is also estimated on basis of its fixed gain. As this signal needs to be encoded before transmission to the far-end speaker, also this fixed gain is available for the near-end speaker.

From the fixed gain of the received signal and the fixed gain of the signal to be sent, which both represent the energies of the respective signals, modification factors are derived,

one for the fixed gain of the signal to be sent and one for the fixed gain of the received signal.

Especially, depending on the respective energies of the signal to be sent and the received signal a further "shifting"of the energies to one of the signals can be performed, thus achieving a clear situation that one signal is predominant which enables a better understanding of the speech associated with the predominant signal. The weaker signal, which most probably resembles echo can be attenuated.

The advantage of this gain loss control method is, that the respective shifting is far less complex to calculate in respect with a gain loss control method in a time domain.

In a preferred embodiment the energies of the received signal and/or the signal to be sent are averaged before the modification factors are derived. The advantage thereof is to take account of short time variations that may distort the actual speaking situation, e. g. that a very loud, short noise, which superimposes the echo, fakes the situation that not the actual speaker but the other part of the communication is active.

It is advantageous if the sum of both modification factors is equal to unity in order to not distort the total energy of the speech.

An encoding apparatus set up for performing the above described encoding method includes at least a processing unit. The encoding apparatus may be part of a communications device, e. g. a cellular phone or it may be also situated in a communication network.

In the following the invention will be described by means of preferred embodiments with reference to the accompanying drawings in which: Fig. 1: depicts a schematic model of a communication between a far-end and a near-end speaker; Fig. 2: shows schematically the function of the AMR encoder; Fig. 3: shows schematically the function of the AMR decoder; Fig. 4: depicts a characteristic of the modification factors versus the energy difference between the signal to be sent and the received signal.

Fig. 5: shows the non-linearity of a codec by example of the AMR codec Fig. 6: depicts a schematic model of a communication between a far-end and a near end speaker with the various gains; Function of a encoder (Fig. 2)

First the function of a speech codec is described by an special implementation of an CELP based codec, the AMR (Adaptive Multirate Codec) codec. The codec consists of a multi-rate, that is, the AMR codec can switch between the following bit rates: 12.2, 10.2, 7.95, 7.40, 6.70, 5.90, 5.15 and 4.75 kbit/s, speech codec, a source-controlled rate scheme including a Voice Activity Detection (VAD), a comfort noise generation system and an error concealment mechanism to compensate the effects of transmission errors.

Fig. 2 shows the scheme of the AMR encoder. It uses a LTP (long term prediction) filter. It is transformed to an equivalent structure called adaptive codebook. This codebook saves former LPC filtered excitation signals. Instead of subtracting a long-term prediction as the LTP filter does, an adaptive codebook search is done to get an excitation vector from further LPC filtered speech samples. The amplitude of this excitation is adjusted by a gain factor g.

The encoding of the speech is described now with reference to the numbers given in Fig. 2 1. The speech signal is processed block-wise and thus partitioned into frames and sub-frames. Each frame is 20 ms long (160 samples at 8 kHz sampling frequency) and is divided into 4 sub-frames of equal length.

2. LPC analysis of a Hamming-windowed frame.

3. Because of stability reasons, the LPC filter coefficients are transformed to Line Spectrum Frequencies (LSF).

Afterwards these coefficients are quantized in order to save bit rate. This step and the previous are done once per frame (except in 12.2 kbit/s mode; the LPC coefficients are

calculated and quantised twice per frame) whereas the steps 4 - 9 are performed on sub-frame basis.

4. The sub-frames are filtered by a LPC filter with re- transformed and quantised LSF coefficients. Additionally the filter is modified to improve the subjective listening quality.

5. As the encoding is processed block by block, the decaying part of the filter, which is longer than the block length, has to be considered by processing the next sub-frame. In order to speed up the minimization of the residual power described in the following, the zero impulse response of the synthesis filter excited by previous sub-frames is subtracted.

6. The power of the LPC filtered error signal e (n) depends on four variables: the excitation of the adaptive codebook, the excitation of the fixed codebook and the respective gain factors ga and gf. In order to find the global minimum of the power of the residual signal and as no closed solution of this problem exists, all possible combinations of these four parameters have to be tested experimentally. As the minimization is hence too complex, the problem is divided into subproblems. This results in a suboptimal solution, of course. First the adaptive codebook is searched to get the optimal lag L and gain factor ga, L. Afterwards the optimal excitation scaled with the optimal gain factor is synthesis- filtered and subtracted from the target signal. This adaptive codebook search accords to a LTP filtering as shown.

7. In a second step of the minimization problem the fixed codebook is searched. The search is equivalent to the previous adaptive codebook search. I. e. it is looked for the codebook vector that minimizes the error criteria. Afterwards the optimal fixed gain is determined. The resulting coding

parameters are the index of the fixed codebook vector J and the optimal gain factor qf, j.

8. The scaling factors of the codebooks are quantized jointly (except in 12.2 kbit/s mode-both gains are quantized scalar), resulting in a quantization index, which is also transmitted to the decoder.

9. Completing the processing of the sub-frame, the optimal excitation signal is computed and saved in the adaptive codebook. The synthesis filter states are also saved so that this decaying part can be subtracted in the next sub-frame.

Function of a decoder (Fig. 3) Now the decoder is described in reference with Fig. 3 As shown in the previous section, the encoder transforms the speech signal to parameters which describe the speech. We will refer to these parameters, namely the LSF (or LPC) coefficients, the lag of the adaptive codebook, the index of the fixed codebook and the codebook gains, as"speech coding parameters". The domain will be called" (speech) codec parameter domain"and the signals of this domain are subscripted with frame index $k$.

Fig. 3 shows the signal flow of the decoder. The decoder receives the speech coding parameters and computes the excitation signal of the synthesis filter. This excitation signal is the sum of the excitations of the fixed and adaptive codebook scaled with their respective gain factors.

After the synthesis-filtering is performed, the speech signal is post-processed.

Speech Modes in the Parameter domain Now the implementation of the gain loss control is described in detail.

The two different conversation modes, single-talk and double- talk, can be transferred to the codec parameter domain.

For single-talk thus eq. (2) can be written as gy (k) c(t)-c = g3 (k) (5) and eq. (3) as gy (k) s (t)-o =ge (k) (6) 9,, (k), is the gain factor of the fixed codebook, also called fixed gain. The sub-indices refer to the bit-stream or the respective speech signal, the fixed gain is computed from. S denotes speech and e denotes echo.

If the far-end speaker A is not talking, the gain of the fixed codebook of the microphone signal y (n) is equal to the gain of the near-end speech. Furthermore the gain of the microphone signal is equal to the gain of the echo signal, if speaker B is not talking.

In double-talk mode the microphone signal is the sum of the speech and of the echo. After LPC filtering the resulting signal is no longer the sum of the filtered speech and echo.

This is depicted in Fig. 5 In Fig. 5, first the speech s (n) is analyzed in order to get a set of LPC coefficients a. The LPC filtered speech is

denoted as s' (n). The same is done with the echo e (n) resulting in e' (n) with coefficients ae. As long as the echo is not a scalar weighted copy of the speech signal e (n) + a*s (n), what is normally true, the auto-correlation functions are not directly proportional and therefore two different sets of filter coefficients are obtained during the LPC analysis.

If the microphone signal y (n) is analysed and filtered, the signal y' (n) is obtained. Because e (n) + ars (n), the microphone signal is not proportional to the speech or echo y (n) =s (n) +e (n) + ors (n) + SOe (n) and thus the respective LPC coefficients aX are different to these of the speech and the echo signal. As all three signals are filtered with miscellaneous LPC filters, the modified microphone signal is no longer the sum of the filtered speech and echo y' (n) 0 s' (n) +e' (n).

Since the LTP filter is based on a similar analysis equation as the LPC filter it can be shown further, that after LTP filtering (or the adaptive codebook search) the modified microphone signal is not additive linked with the modified speech and echo signal. Therefore different codebook vectors and different gains are obtained by the fixed codebook search.

Because of the non-linearity of the speech codec shown above, the gain of the microphone signal can not be written as the sum of the gains of speech and echo signal in double-talk mode. Thus equation (1) transforms in g_y (k) =f (g-s (k), ge (k) ) (7) ;

The function can be restricted, if the following is assumed: If neither the far-end speaker A nor the near-end speaker B are talking, there is no microphone input signal ($s (t) =0, e (t) =0-> y (t) =0). By encoding a null-file, the gain of the fixed codebook is set equal to zero (x (t) =0-> g_x (k) =0).

After LPC and LTP analysis and processing, the residual speech signal is still null. As the vectors of the fixed codebook are non-zero, the corresponding gain has to be set to zero, so that this excitation synthesises a null-file.

Therefore eq. (A) is restrained to g_y (k) = f (g_s (k) =0, g_e (k) =0) = 0 (8) This means that the"gain-function"has no absolute term.

Replacement of codec parameters Noise has an influence on the above described codec parameters.

Therefore the AMR decoder was slightly modified such that it could read two bit-streams coming from encoded files with different background noise levels. Then, according to the desired experiment, the decoder can use some of the parameters encoded with high noise level or some parameters encoded with low noise level. This permits to evaluate the influence of the codec parameters on the noise level as well as on the quality of the decoded speech.

The results of a Comparison Category Rating (CCR) listening test done in this experiment showed that the level of background noise is more or less represented by the gain of the fixed codebook. Thus modifications of this gain leads to

noise reduction. As in the time and/or frequency domain similar methods are applied on noise as well as on echo cancellation, the conclusion of this experiment is transferred to the echo cancellation problem in the parameter domain. Therefore the fixed codebook gain is modified in order to reduce the acoustic echo.

Gain Loss Control on Speech Codec Parameters (Fig. 6) With reference to Fig. 6 an echo cancellation method that can be implemented directly in the AMR codec or any speech codec based on CELP coding. The method is based on the idea to modify the fixed codebook gain in the encoder as well as in the decoder. The changing of these parameters is done on sub-frame basis using attenuation factors, which are determined according to the energy of the signal in the loudspeaker and microphone path respectively. This energy estimation is performed on basis of speech codec parameters.

Fig. 6 shows the principle of the gain loss control.

Furthermore the loudspeaker-room-microphone system is depicted. The encoded speech bit stream x (k) is transmitted and decoded to the speech x (n). The speech signal x (n) is reflected in the surrounding room, for example in a car or in an office, and recorded by the microphone device. The reflection is described by convoluting the speech x (n) by the impulse response h (n) resulting in the echo e (n). Beneath the echo, the microphone signal also consists of the usual speech s (n). Thus the microphone signal y (n) is the sum of s (n) and e (n). This signal is encoded resulting in the bit stream y (k).

The central idea is to attenuate the fixed gain of the microphone signal by a factor of am (k) and the fixed gain of the loudspeaker path by al (k). In order to determine these modification factors the energy is estimated on basis of codec parameters. Therefore the fixed gains of the loudspeaker path (gl f (k)) and the microphone path (gm, f (k)) and the respective adaptive gains glva (k) or gm, a (k) of the loudspeaker path are passed to the control unit.

Estimation of the energy The estimation of the speech signal energy is done on sub- frame basis as the gains of the fixed and adaptive codebook are computed every sub-frame. Because the estimation is done on loudspeaker path parameters as well as on microphone path parameters, the indices 1 and m of the gains are now neglected. With reference to Fig. 2 different estimations are foreseen: a) A simple idea is to consider the fixed codebook and to use the energy of the codeword Ef multiplied with the corresponding gain gf (k) E, (k) =Ef*g, j (k) In a speech synthesis model the vectors of the fixed codebook can be seen as excitations compared to a noise generator of equal power, as these normalized vectors represent the residual whitened part of the LPC and LTP filtered speech.

The fixed codebook gain amplifies these vectors so that the power of the synthesized speech is increased. b) A further step is to include the excitation of the adaptive codebook and its gain ga (k) to the energy estimation process

#2(k) = Ef*gf(k) + #2(k-1)ga(k) As the adaptive codebook is updated with the summed and scaled excitation of the fixed and adaptive codebook vector of the former sub-frame, a recursive estimation is used in fact.

This approximation has been proven especially relevant in public. c) The third estimation method makes use of the speech synthesis filter, i. e. that the estimated energy Ê2 (k) is multiplied by an amplification factor A (k) EZ (k) = E f * A (k) Therefore the LPC coefficients of the synthesis filter are used to determine the amplification factor A (k) with H (f) = 1/A(z) #z-cj##cj2#f where A (z)-1 is the transfer function of the synthesis filter in the analysis through synthesis system. Finally this estimation rule considers the whole synthesis process including the excitations of both codebooks, the fixed and the adaptive gain and the speech synthesis filter.

The different estimation methods are tested with various speech signals. The energy may be computed on a sub-frame basis. Thus all energy graphs refer to the energy per sub- frame. The scaling of the energy is done on a dB scale converting the real number representation to a logarithmic

scale without normalization. Hence 0 dB refers to a sum of 40 speech samples (5 ms*8 kHz) with a result of 1.

In order to get a more precise statement concerning the applicability of the different estimation methods than described by methods a-c, the auto-and cross-correlation function of the energy signals are computed. The correlation is only processed during speech periods, so that different floors of the estimation, mainly during speech pauses, are not taken into account. Therefore the VAD (voice acitivity detector) of the AMR codec is used to detect these speech pauses. a) Energy Estimation based on codec parameters First the energy estimation of the signal in the loudspeaker path and in the microphone path is described. As set out above, the respective energies are used for deriving attenuation factors to be applied on the fixed gains.

The energy El of the (loudspeaker) speech signal x (k) and the energy Ers ouf the microphone signal y (k) are estimated based on the respective adaptive and fixed gains El(k) = gt,f(k)+g1,a(k)El(k-1) Em(k) = gm,f(k) + Where gl, f- (resp. gn1, g) stands for the fixed gain and gl,-, (resp g, n) for the adaptive gain of the signal x (k) (resp. y (k)) of the loudspeaker path and the microphone path respectively.

As a result the decoder and the encoder is modified such, that it is possible to extract the gains. A correlation

analysis of the energy per sub-frame and the estimated energy showed that this method is applicable to estimate properly the energy.

Determination of the attenuation factors After having estimated the respective energies the attenuation factors are derived therefrom.

In order to e. g. not overestimate short-term variations of said energies, long-term estimations of the energy of the loudspeaker path E, and of the microphone path Em are determined by using a non-linear filter described by t E, (k) _ l _ a f) E, (k) + a. E, (k-1) else (Wa.) Fm (k) +a. Fm (k-I) if-E", (k) >E", (k-1) (1-af) E", (k) +afE, n (k-1) else , where a@ describes the behaviour of the speech signal at an rising edge and af at an falling edge.

With a, <ai- an increase in the speech level is emphasised and the filter is decaying slower, if the speech signal decreases in energy. Using this strategy short speech pauses, for example during spoken words, are neglected. Especially the values ar=0. 7 and a.,, =0. 95 have been found useful in practice.

In another preferred embodiment the low-pass filtered energy of a speech signal is taken to determine an attenuation factor. This factor decreased the amplitude of the microphone signal depending on its energy. The estimate of the energy is

thus (low-pass) filtered in order to eliminate fast changes.

Therefore small speech pauses are neglected by determining the attenuation factors. For the filtering a non-linear filter is used.

The resulting long time estimations {E} _l (k) and (E} m (k) $ are used to determine the attenuation factors for the loudspeaker and microphone path, respectively.

In the following a magnitude representative for the energy difference Ediff (k) is determined, e. g. the energy difference Edirf (k) is chosen as the ratio of the long-term loudspeaker and microphone energy in the log-area El (k) m\ Using this energy difference the attenuation factor is derived, which is for the loudspeaker gain e. g. characterised by f 0 E) <-/2 al (k) = lImEdiff (k) +0. 5-m/2<Edj (k) < m/2 l /2<F,, y () The corresponding attenuation factor of the microphone path is calculated then by am(k) = 1-al(k) Thus, either the microphone path or the loudspeaker path are attenuated. Hence it is guaranteed that at least one of both paths is attenuated. If the microphone signal is decreased

completely then the loudspeaker signal is not modified at all and vice versa.

In the preferred embodiment, where the low pass filtered energy of the microphone and loudspeaker signals are used for deriving the attenuation factors al (k) and an) (k), the latter are consequently functions of the low-pass filtered energy estimations.

Principally, the attenuation factors a : (k) and am (k) should be in the interval [0 ; 1], so that the fixed gain is either attenuated completely or not at all in the extreme cases.

The acoustic echo can be decreased either before it is generated by modifying the loudspeaker signal or after the forming by attenuating the microphone signal. Therefore, as set out above, choose the attenuation factor am (k) can be chosen according to a_m (k) = 1-a_l (k) The control characteristics can be modified easily for silence mode, i. e. both speakers do not talk. If the estimated energy of the loudspeaker signal and of the microphone signal are below a certain threshold, am = 0,5 can be chosen, meaning that the loudspeaker path is as strong attenuated as the microphone path.

The behaviour of the attenuation factors a, and a, can be seen in Fig. 4, where these attenuation factors are depicted versus energy difference expressed in dB. m denotes a threshold of the energy difference, where-m = 10log (El/E thus the loudspeaker energy is considered as negligible. For m = lOlog (Ern/El) the microphone energy is negligible.

Curve 1 shows an abrupt changing of the attenuation factors from 0 to 1 and 1 to 0 if the energy of the loudspeaker path and microphone path are equal.

Curve 2 shows a linear behaviour, where the respective modification factors are increasing/decreasing startin from m/2 or-m/2.

In curve 3 the linear behaviour of curve 2 is smoothed.

The curves do not need to be symmetric, also asymetric behaviour can be applied, e. g. for applications where the loudspeaker signal or the microphone signal are emphasised.

Previous Patent: SAMPLING RATE CONVERTING APPARATUS, ENCODING APPARATUS, DECODING APPARATUS, AND METHODS THEREOF

Next Patent: SPEECH CODING METHOD APPLYING ECHO CANCELLATION BY MODIFYING THE CODEBOOK