A METHOD FOR ESTIMATING SIGNAL CODING PARAMETERS

Title:

A METHOD FOR ESTIMATING SIGNAL CODING PARAMETERS

Document Type and Number:

WIPO Patent Application WO/2008/109904

Kind Code:

Abstract:

A method for estimating coding parameters of a predictive filter model of a digital signal, in particular speech signal, comprises: receiving a segment of said signal; computing the spectrum of said segment; estimating the background noise in said segment; and estimating the fundamental frequency in said segment; computing a spectral mask on the basis of said background noise and said fundamental frequency; and determining those coding parameters that substantially minimize a cost function which is based on said spectrum, said spectral mask and said predictive filter model.

More Like This:

JP3562223	Karaoke equipment
JP3452792	SINGING MARKING SYSTEM FOR KARAOKE DEVICE
WO/2023/049051	AUDIO SYSTEM FOR SPATIALIZING VIRTUAL SOUND SOURCES

Inventors:

WERUAGA LUIS (AT)

Application Number:

PCT/AT2008/000087

Publication Date:

September 18, 2008

Filing Date:

March 12, 2008

Export Citation:

Click for automatic bibliography generation Help

Assignee:

OESTERREICHISCHE AKADEMIE DER (AT)
AUSTRIA WIRTSCHAFTSSERVICE GMBH (AT)
WERUAGA LUIS (AT)

International Classes:

G10L25/48; G10L25/12

Foreign References:

EP0851406A2

1998-07-01

Other References:

GU L ET AL: "Perceptual harmonic cepstral coefficients for speech recognition in noisy environment", 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). SALT LAKE CITY, UT, MAY 7 - 11, 2001, IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), NEW YORK, NY : IEEE, US, vol. VOL. 1 OF 6, 7 May 2001 (2001-05-07), pages 125 - 128, XP010803060, ISBN: 0-7803-7041-4
HERMANSKY H ET AL: "Perceptual linear predictive (PLP) analysis-resynthesis technique", EUROSPEECH 91. 2ND EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY PROCEEDINGS ISTITUTO INT. COMUNICAZIONI GENOVA, ITALY, 1991, pages 329 - 332 vol.1, XP002442693
LUKASIAK J ET AL: "Linear prediction incorporating simultaneous masking", 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS (CAT. NO.00CH37100) IEEE PISCATAWAY, NJ, USA, vol. 3, 2000, pages 1471 - 1474 vol., XP002442694, ISBN: 0-7803-6293-4
ZHAO Y: "FREQUENCY-DOMAIN MAXIMUM LIKELIHOOD ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION IN ADDITIVE AND CONVOLUTIVE NOISES", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 8, no. 3, May 2000 (2000-05-01), pages 255 - 266, XP011054019, ISSN: 1063-6676

Attorney, Agent or Firm:

WEISER, Andreas (Wien, AT)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims :

1. A method for estimating coding parameters of a predictive filter model of a digital signal, in particular speech signal, comprising: receiving a segment of said signal; computing the spectrum of said segment; estimating the background noise in said segment; and estimating the fundamental frequency in said segment; characterized by the steps of: computing a spectral mask on the basis of said background noise and said fundamental frequency; and determining those coding parameters that substantially minimize a cost function which is based on said spectrum, said spectral mask and said predictive filter model.

2. The method of claim 1, wherein said coding parameters are the gain level and the filter coefficients of said predictive filter model.

3. The method of claim 2, wherein said cost function is

with X{ω) being said spectrum, a{ω) being said spectral mask, η being said gain level, and

H{ω) being the transfer function, based on said filter coefficients, of the synthesis equivalent of the predictive filter model.

4. The method of any of the claims 1 to 3, wherein said spectral mask is

with ωo being said fundamental frequency and p(ω) being a noise mask based on said background noise.

5. The method of claim 4, wherein said noise mask is

with X(ω) being said spectrum and N(ω) being the power spectrum of said background noise.

6. The method of any of the claims 1 to 5, wherein said step of minimizing said cost function is performed by means of a multivariate Newton-Raphson algorithm.

7. The method of any of the claims 1 to 6, wherein said predictive filter model is defined by its synthesis equivalent being a parametric all-pole filter model, an autoregressive coefficients filter (ARC) model, a reflection coefficients filter (RC) model, and/or a line spectral frequencies (LSF) model.

8. The method of any of the claims 3 to 7, wherein said cost function additionally comprises a regularization term such that

with

B(ω) being a regularization spectrum referring to an a priori guess of spectral areas of the spectrum which lack information, and λ being a regularization constant, 0 < λ < 1. 9. The method of claim 8, wherein B(ω) « X(ω) and λ << 1. 10. The method of any of the claims 3 to 8, wherein in the computing step of minimizing said cost function an amended

transfer function H{ω) is used comprising said gain level η merged into said transfer function H(ω) .

Description:

A Method for Estimating Signal Coding Parameters

The present invention relates to an improved technique for encoding a digital signal, in particular a speech signal. More specifically, the invention concerns a method for estimating coding parameters of a predictive filter model of a digital signal according to the preamble of claim 1.

A widely-used technique for speech coding is the so-called Linear Predictive Coding (LPC) . Said technique computes the pa- rameters of an autoregressive filter from the time samples of a digital speech signal. The computation of those parameters is well-known to those of ordinary skill in the field of the present invention. An example of such computation is found in ITU- T Recommendation G.122.2, "Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB)", Geneva 2002. Most of the commercial speech coders such as the LPC Vocoder, the Coded-Excited Linear Predictive Coding (CELP) , and its posterior variants (ACELP, VSELP) , among many others, rely on the LPC technique. In general, the LPC technique is founded on the minimization of the energy of the prediction error e[n]

where x[n] is a windowed segment of the input digital signal, a _m are the linear prediction coefficients and M is the model order. In the later decoding, i.e. synthesis stage, the signal

— ? —

is re-generated on the basis of these coefficients input to the synthesis equivalent of the predictive filter model, which synthesis equivalent is defined by the transfer function

The energy of said prediction error can be formulated, by using Parseval's relation, in the frequency domain as a cost function

with X{ω) being the spectral transformation of the signal segment x[n] .

According to the mentioned equivalence between time and frequency, the solution delivered by the LPC technique is thus equivalent to the linear prediction coefficients that make cost function E minimal.

Speech coders based on the LPC technique are known to deliver coded speech of acceptable but moderate quality. Furthermore, the performance of automatic speech recognition systems drops notably when fed with said coded signal instead of the raw signal. The author of the present invention found out that although a predictive filter model is adequate for describing the physical production of speech, the LPC technique is unable to obtain the parameters of said model with enough accuracy.

It is therefore an object of the invention to determine coding parameters for digital signals, in particular speech signals, with improved accuracy.

This object is achieved by means of a method for estimat- ing coding parameters of a predictive filter model of a digital signal, in particular speech signal, comprising: receiving a segment of said signal; computing the spectrum of said segment; estimating the background noise in said segment; and estimating the fundamental frequency in said segment; which method is characterized by the steps of: computing a spectral mask on the basis of said background noise and said fundamental frequency; and determining those coding parameters that substantially minimize a cost function which is based on said spectrum, said spectral mask and said predictive filter model.

In the present disclosure the term "substantially minimizing" is intended to comprise both, making the cost function minimal as well as making the cost function at least a suffi- ciently low value, i.e. a value within a given or acceptable tolerance interval from that minimum.

Thus, the proposed coding of each signal segment comprises two main processing steps: on the one hand, the computation of a spectral mask that weights the relevance of each spectral sample of said segment spectrum, wherein the relevance is determined on the basis of the fundamental frequency of the speech utterance and the spectral characteristics of the noise

in the segment, and on the other hand the computation of the coding parameters that make a specific cost function minimal, or at least an appropriate level, where said cost function is built with said segment spectrum, said spectral mask and the parametric filter model.

The invention is based on the insight that not all spectral samples in an input spectrum necessarily contain valuable information for the estimation of linear prediction coefficients: for instance, the spectrum of voiced speech utterances contains only valuable information at harmonic frequencies, and in the presence of background noise the spectrum of the speech can be corrupted at certain frequency components if its level is lower than that of the noise at said components.

In contrast to conventional LPC techniques which are se- verely affected by these effects, the novel frequency-selective approach of the invention increases the coding precision and efficiency, especially in the case of voiced utterances and/or of noise-corrupted signal segments. The method of the invention computes the speech coding parameters on the basis of the spec- trum of the signal segment, where said parameters are related to the popular speech formation model, with significantly improved accuracy.

The method of the invention can replace e.g. the LPC technique in those speech/audio coders that operate with said tech- nique. The invention can also be used in speech/audio coders that do not operate with said LPC technique, such as Harmonic Coders and Hybrid Coders.

Apart therefrom, the improved accuracy of the estimation of the coding parameters also implies a more accurate estimation of the spectral energy. Thus, the present invention can be used in automatic speech recognition systems also in such a way that spectral-like features can be drawn directly from the estimated filter model and gain level instead of from the signal spectrum.

In a preferred embodiment of the invention said coding parameters are the gain level and the filter coefficients of said predictive filter model.

In a particular preferred embodiment said cost function is

with X{ω) being said spectrum, α(ω) being said spectral mask, η being said gain level, and

H{ω) being the transfer function, based on said filter coefficients, of the synthesis equivalent of the predictive filter model.

This new approach of cost function minimization is on the one hand a processing task which is readily feasible with state-of-the-art processing hardware and/or software, and on the other hand ensures the estimation of coding parameters with remarkable improved accuracy.

According to a further preferred feature of the invention said spectral mask is chosen as

a{ω)=p(ω)^δ{ω-kω ₀ )

with ω ₀ being said fundamental frequency and p(ω) being a noise mask based on said background noise. In particular, said noise mask is preferably

with X{ω) being said spectrum and N(ω) being the power spectrum of said background noise.

The step of minimizing said cost function can be performed by means of any suitable algorithm of the art; preferably, a multivariate Newton-Raphson algorithm is used.

In general, the coding parameters determined according to present invention can be translated into any parameterization which are needed for the subsequent decoding, i.e. synthesis stage. Particularly, it is preferred that said predictive fil- ter model is defined by its synthesis equivalent being a parametric all-pole filter model, an autoregressive coefficients filter (ARC) model, a reflection coefficients filter (RC) model, and/or a line spectral frequencies (LSF) model using the coding parameters determined. Further details and advantages of the invention will become apparent from ^■ the appended claims and the following detailed description of a preferred embodiment under reference to the enclosed drawings in which

Fig. 1 illustrates the analysis stage of a simplified ge- neric speech/audio coder containing the method for computing

the parameters of the speech production model in accordance with the present invention, and

Figs. 2a-d show the superior performance obtained with the present invention in an example scenario. From the field of bioacoustics it is known that the biological hearing sense responds to the logarithm of the sound intensity. The invention is based on the insight that this bio- acoustic principle of logarithmic sense can be introduced into a maximum-likelihood (ML) correspondence according to equations (1) to (3) between the spectral samples X(ω) and the synthesis part H(ω) of the prediction filter model, resulting in

where ε(ω) is the spectral residue defined as

a{ω) being a spectral mask and P[] denoting the probability density function (PDF) .

Given that the spectral samples X{ω) are .commonly charac- terized by a Gaussian random variable, the PDF of the logarithmic residual is

P[ε]=exp(ε-expε) (6)

According to the maximum likelihood criterion (4), the ML functional can now be set up as the following cost function:

The spectral mask a{ω) plays a vital role in the cost function ML in that it contains for each frequency a value that weights the relevance of the spectral sample at said frequency. The gain level η and the parameters a _m that define the synthesis filter H(ω) correspond to the parametric degrees of freedom of the cost function ML. As will be apparent for one skilled in the art, any reference in this disclosure to the cost function ML also comprises any mathematically or technically equivalent expression of equation (7), e.g. a cost function differing from equation (7) in an additive term that does not depend on said parametric degrees of freedom.

Fig. 1 shows in the form of a block diagram an analysis stage 100 of a speech coder that uses the method of the present invention. A signal segmentation block 10 performs the usual segmentation of an input digital signal x into segments, generally denoted by x[n] . A spectral transformation block 20 performs the spectral transformation of said segment. Block 20 performs e.g. a Discrete Fourier Transform, Discrete Sinus Transform and/or a Fan-Chirp Transform, among other popular choices .

A spectral mask block 30 performs the computation of the spectral mask α(ω) . The segment x[n] is assumed to be corrupted by background noise whose spectral characteristics are described by the power spectrum N(ω) . Furthermore said segment

— Q —

may contain a speech utterance of "voiced" nature, with fundamental frequency ω ₀ (in case of "unvoiced" speech utterances, the fundamental frequency is considered zero or very low) . Therefore, by making use of the frequency-selective properties of cost function ML, the spectral mask is computed by block 30 as

a(ω)= p(ω)∑δ{ω-kω ₀ )

where δ{ω) is the "Dirac delta" function, and p(ω) is the noise mask computed as

The goal of said spectral mask is two-fold: on the one hand to disable those spectral samples of the segment spectrum that are sensibly corrupted by noise, and on the other hand to discard the spectral samples that do not correspond to harmonic frequencies. Said harmonic frequencies point out to the high- energy spectral peaks that delineate the spectral envelope of the speech utterance. The estimation of the power spectrum N(ω) is carried out by a noise estimation block 35 according to known ad-hoc techniques, such as a Kalman filter estimation, et cet . The estimation of the fundamental frequency ω ₀ is carried out by a pitch analysis block 40 according to known ad-hoc methods, e.g. peak detection of the autocorrelation of the segment, et cet.

A cost function minimization block 50 carries out the computation of the gain level η and parameters a _m as coding parameters of the filter model that make cost function ML minimal, or at least below a predetermined level. This minimization task is a readily feasible computer programming task. A possible choice for the implementation of the minimization task is the multivariate Newton—Raphson algorithm.

The output parameters of the speech coder analysis stage 100 are the gain level η, the parameters a _m of the predictive filter, and - if desired - the pitch of the excitation ωo which can be taken from the output of block 40. Said parameters correspond to the output of the analysis stage of conventional speech coders e.g. relying on the LPC technique. Therefore, the method of the present invention can supersede the LPC technique in said coders.

Although all processors of the analysis stage 100 operate with time-discrete and frequency-discrete samples, for the sake of clarity the mathematical description of the invention has been given in continuous frequency. One skilled in the art will immediately recognize that this choice does not affect the essence of the present invention.

Fig. 2a-d illustrate the frequency-selective properties of the present invention on a segment of voiced speech:

Fig. 2a shows an exemplary input signal segment x[n] of 200 samples;

Fig. 2b depicts the logarithmic spectrum envelope obtained with conventional LPC (dotted line) vs. the envelope obtained with the method of the invention (solid line) ;

Fig. 2c shows the prediction error e[n] with the inventive method; and

Fig. 2d the prediction error e[n] with conventional LPC technique .

It can be clearly seen that the present invention achieves higher accuracy in estimating the coding parameters of a pre- dictive filter model, manifested by a resulting spectral envelope interpolating narrowly, i.e. matching closely, the energy of the harmonics, see Fig. 2b, and a prediction error closer to the actual excitation, see Fig. 2c.

The present description contained specific information pertaining both to the scientific basis and the implementation of the present invention. One skilled in the art will recognize that the present invention may be implemented in a manner different from that specifically discussed in the present application. The proposed method can e.g. be implemented and realized efficiently in a digital computer.

In particular, for practical implementation purposes the following variants of the invention have proven to yield superior and stable results even under varying conditions of input signals . In a first preferred variant, to avoid numerical instabilities of the algorithm when computing the minimum of the

cost function ML the cost function can be optionally provided with an additive "regularization" term according to

^' (10) with

B(ω) being a regularization spectrum referring to an "a priori guess" of spectral areas of the spectrum which lack information, and λ being a regularization constant positive and smaller than one (0 < A < 1) which determines how weak or strong the regularization is made.

Preferably, the regularization spectrum B(ω) is set to a low value (B(ω) « X{ω) ) or to the solution from previous seg- ments of the digital signal; and the regularization constant λ is positive and much smaller than one (λ << 1), e.g. λ < 0.2, preferably λ = 0.1.

In this way, undesired numerical instabilities due to large zero-valued areas of the spectral mask a(ω) , e.g. because of large noise and/or large values of the fundamental frequency, can be avoided in practical implementations of the method of the invention.

In a further preferred variant, when computing the minimum of the cost function ML optionally an amended transfer function

H{ω) = - M

--Jam l -∑«

IB=I can be used into which the gain level η has been merged to de¬

fine "normalized" linear prediction coefficients a _m such that

minimizing the cost function

is equivalent to minimizing equation (7), or minimizing

is equivalent to minimizing equation (10) , respectively, yielding excellent numerical and convergence properties in practical implementations.

The invention is not limited to the preferred embodiments described in detail above but encompasses all variants and modifications thereof which will become apparent for the man skilled in the art from the present disclosure and which fall into the scope of the appended claims.

Previous Patent: DIAGNOSIS OF SEPTIC COMPLICATIONS

Next Patent: SOLID FUEL FURNACE