Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR SPEAKER RECOGNITION
Document Type and Number:
WIPO Patent Application WO/2008/028511
Kind Code:
A1
Abstract:
The present invention relates to a system and method for speaker recognition. In order to improve speaker recognition in noisy environments the proposed method comprises the step of filtering frames (37) of time domain speech samples of an input utterance of noisy speech with a hypothesized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density (39) of a frame (32) of the speaker specific reference template (31) is used as a signal estimate and the power spectral density (40) of a frame (41) of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate, the output (42) of the hypothesized Wiener filter being cleaned frames (43) of time domain speech samples of the input utterance of noisy speech.

Inventors:
RAMASUBRAMANIAN, Viswanathan (221 Devasandra Layout, K.R. Puram, Bangalore 6, 560 03, IN)
VIJAYWARGIAY, Deepak (Sri Brij Kunj 3-2-761 Kachiguda, Hiderabad 7, 50002, IN)
Application Number:
EP2006/008786
Publication Date:
March 13, 2008
Filing Date:
September 08, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS AKTIENGESELLSCHAFT (Wittelsbacherplatz 2, München, 80333, DE)
RAMASUBRAMANIAN, Viswanathan (221 Devasandra Layout, K.R. Puram, Bangalore 6, 560 03, IN)
VIJAYWARGIAY, Deepak (Sri Brij Kunj 3-2-761 Kachiguda, Hiderabad 7, 50002, IN)
International Classes:
G10L17/00; G10L17/00
Attorney, Agent or Firm:
SIEMENS AKTIENGESELLSCHAFT (Postfach 22 16 34, Müchen, 80506, DE)
Download PDF:
Claims:

Patent claims

1. System for speaker recognition comprising:

- means (30) for storing at least one speaker specific refer- ence template (31) comprising a sequence of frames (32) of time domain speech samples from a clean speech utterance of a password text of the respective speaker (33),

- means (34, 35) for recording a sequence (36) of frames (37) of time domain speech samples of an input utterance of noisy speech of the password text,

- means (38) for filtering the frames (37) of time domain speech samples of the input utterance of noisy speech with a hypothesized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density (39) of a frame (32) of the speaker specific reference template (31) is used as a signal estimate and the power spectral density (40) of a frame (41) of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate, the output (42) of the hypothesized Wiener filter be- ing cleaned frames (43) of time domain speech samples of the input utterance of noisy speech,

- feature extraction means (44) for converting a frame (32, 43) of time domain speech samples into a corresponding feature vector (47, 46), - means (45) for determining the respective local distance between a feature vector (46) of a cleaned frame (43) and the feature vector (47) of the respective frame (32) of the speaker specific reference template (31) whose power spectral density (39) was used as a signal estimate for the filtering to get the cleaned frame (43), and for assigning the respective local distance to the respective pair of frame (37) of time domain speech samples of the input utterance of noisy speech and frame (32) of the speaker specific reference template (31), - a dynamic time warping module (48) for determining a warping path (49) for an optimized time alignment between a sequence of feature vectors of the input utterance of noisy speech and a sequence of feature vectors of the clean speech

utterance of the respective speaker, minimizing the accumulation (50) of the respective assigned local distances over the warping path (49) .

2. System according to claim 1, where the means (30) for storing at least one speaker specific reference template (31) are provided for storing templates for several password texts .

3. System according to claim 1 or 2, where the means (30) for storing at least one speaker specific reference template (31) are provided for storing a set of speaker specific reference templates for each password text.

4. System according to any of the preceding claims, where a one-pass dynamic programming framework is provided for matching an input feature vector sequence against reference templates for several password texts, using multiple templates of each password text.

5. System according to any of the preceding claims, where the means (45) for determining the respective local distance are provided for determining a grid of the respective local distances d w (i,j),i = \,...,T x ,j = \,...,T y , where each column i contains the local distances between the clean frames y Jt j = \,...,T y and the corresponding cleaned frames x v ,j = l,...,T y .

6. System according to any of the preceding claims, where the power spectral density (40) of a frame (41) of time domain noise sample of the most recent non-speech region of the input utterance of noisy speech is used as a noise estimate.

7. System according to any of the preceding claims, where the feature extraction means (44) use mel-frequency-cepstral co- efficients.

8. System according to any of the preceding claims, where speaker identification means are provided for comparing respective minimized accumulations (50) of several speakers (33) and identifying the speaker with a lowest accumulation.

9. System according to any of the preceding claims, where means for speaker verification are provided for normalizing the accumulation (50) of a speaker (33) by a background speaker's score, computed between the input utterance and the respective background speaker specific reference templates (31) and for comparing the normalized score to a threshold.

10. System according to any of the preceding claims, where the system is for use in a secure access control system.

11. Method for speaker recognition comprising the following steps :

- storing at least one speaker specific reference template (31) comprising a sequence of frames (32) of time domain speech samples from a clean speech utterance of a password text of the respective speaker (33),

- recording a sequence (36) of frames (37) of time domain speech samples of an input utterance of noisy speech of the password text, - filtering the frames (37) of time domain speech samples of the input utterance of noisy speech with a hypothesized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density (39) of a frame (32) of the speaker specific reference template (31) is used as a signal estimate and the power spectral density (40) of a frame (41) of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate, the output (42) of the hypothesized Wiener filter being cleaned frames (43) of time domain speech samples of the input utter- ance of noisy speech,

- converting a frame (32, 43) of time domain speech samples into a corresponding feature vector (47, 46) ,

- determining the respective local distance between a feature vector (46) of a cleaned frame (43) and the feature vector (47) of the respective frame (32) of the speaker specific reference template (31) whose power spectral density (39) was used as a signal estimate for the filtering to get the cleaned frame (43),

- assigning the respective local distance to the respective pair of frame (37) of time domain speech samples of the input utterance of noisy speech and frame (32) of the speaker spe- cific reference template (31),

- determining a warping path (49) for an optimized time alignment between a sequence of feature vectors of the input utterance of noisy speech and a sequence of feature vectors of the clean speech utterance of the respective speaker, and - minimizing the accumulation (50) of the respective assigned local distances over the warping path (49) .

12. Method according to claim 11, comprising the step of storing speaker specific reference templates (31) for several password texts.

13. Method according to claim 11 or 12, comprising the step of storing a set of speaker specific reference templates (31) for each password text.

14. Method according to any of claims 11 to 13, comprising the step of matching the input feature vector sequence against reference templates for several password texts, using multiple templates of each password text, within a one-pass dynamic programming framework.

15. Method according to any of claims 11 to 14, comprising the step of determining a grid of the respective local distances d w {i,j),i-\,...,T x ,j = \,...,T y , where each column i contains the local distances between the clean frames y ] ,j = \,...,T y and the corresponding cleaned frames x v ,j = \,...,T .

16. Method according to any of claims 11 to 15, where the power spectral density (40) of a frame (41) of time domain noise sample of the most recent non-speech region of the input utterance of noisy speech is used as a noise estimate.

17. Method according to any of claims 11 to 16, where for converting a frame (32, 43) of time domain speech samples into a corresponding feature vector (47, 46) mel-frequency- cepstral coefficients are used.

18. Method according to any of claims 11 to 17, comprising the step of comparing respective minimized accumulations (50) of several speakers (33) and identifying the speaker with a lowest accumulation.

19. Method according to any of claims 11 to 18, comprising the steps of normalizing the accumulation (50) of a speaker (33) by a background speaker's score, computed between the input utterance and the respective background speaker spe- cific reference templates (31) and of comparing the normalized score to a threshold.

20. Method according to any of claims 11 to 19, applied in a secure access control system.

Description:

Description

System and method for speaker recognition

The present invention relates to a system and method for speaker recognition.

Robustness of a speaker-recognition system to additive background noise is particularly important when the system needs to operate in noisy environments. This is an even more challenging task when the system has to perform recognition in a noisy environment different from that of training. For instance, in text-dependent speaker-recognition, training involves using words models (or sub-word models) in the form of templates or hidden Markov models (HMMs) . If these templates or HMMs are obtained in clean environment and tested in a noisy environment, it is well known that the recognition accuracy suffers significantly due to the inherent mismatch between the "clean" models and the "noisy" test speech data.

The article A. D. Berstein, I. D. Shallom, "An hypothesized wiener filtering approach to noisy speech recognition", Proc. ICASSP' 91, pages 913-916, 1991, proposes a hypothesized Wiener filtering approach to noisy speech recognition.

The object of the present invention is to improve speaker recognition in noisy environments .

This object is achieved by a system for speaker recognition comprising:

- means for storing at least one speaker specific reference template comprising a sequence of frames of time domain speech samples from a clean speech utterance of a password text of the respective speaker, - means for recording a sequence of frames of time domain speech samples of an input utterance of noisy speech of the password text,

- means for filtering the frames of time domain speech samples of the input utterance of noisy speech with a hypothesized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density of a frame of the speaker spe- cific reference template is used as a signal estimate and the power spectral density of a frame of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate, the output of the hypothesized Wiener filter being cleaned frames of time domain speech sam- pies of the input utterance of noisy speech,

- feature extraction means for converting a frame of time domain speech samples into a corresponding feature vector,

- means for determining the respective local distance between a feature vector of a cleaned frame and the feature vector of the respective frame of the speaker specific reference template whose power spectral density was used as a signal estimate for the filtering to get the cleaned frame, and for assigning the respective local distance to the respective pair of frame of time domain speech samples of the input utterance of noisy speech and frame of the speaker specific reference template,

- a dynamic time warping module for determining a warping path for an optimized time alignment between a sequence of feature vectors of the input utterance of noisy speech and a sequence of feature vectors of the clean speech utterance of the respective speaker, minimizing the accumulation of the respective assigned local distances over the warping path.

This object is achieved by a method for speaker recognition, comprising the following steps:

- storing at least one speaker specific reference template comprising a sequence of frames of time domain speech samples from a clean speech utterance of a password text of the respective speaker, - recording a sequence of frames of time domain speech samples of an input utterance of noisy speech of the password text,

- filtering the frames of time domain speech samples of the input utterance of noisy speech with a hypothesized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density of a frame of the speaker specific reference template is used as a signal estimate and the power spectral density of a frame of time domain noise sample of a non- speech region of the input utterance of noisy speech is used as a noise estimate, the output of the hypothesized Wiener filter being cleaned frames of time domain speech samples of the input utterance of noisy speech,

- converting a frame of time domain speech samples into a corresponding feature vector,

- determining the respective local distance between a feature vector of a cleaned frame and the feature vector of the re- spective frame of the speaker specific reference template whose power spectral density was used as a signal estimate for the filtering to get the cleaned frame,

- assigning the respective local distance to the respective pair of frame of time domain speech samples of the input ut- terance of noisy speech and frame of the speaker specific reference template,

- determining a warping path for an optimized time alignment between a sequence of feature vectors of the input utterance of noisy speech and a sequence of feature vectors of the clean speech utterance of the respective speaker, and

- minimizing the accumulation of the respective assigned local distances over the warping path.

The underlying idea of the invention is to make use of the hypothesized Wiener filtering (HWF) approach for realizing noise robust text-dependent speaker recognition. The text- dependent speaker-recognition problem represents an unique and ideal setting for deriving an advantage with the HWF algorithm, wherein "clean reference templates" of the words of the password text which is supposed to be the text of the input noisy speech are available. The proposed HWF approach exploits this effectively for robust speaker-recognition with

high recognition accuracies for both additive white noise and non-stationary color noise conditions.

In a further embodiment of the invention, the means for stor- ing at least one speaker specific reference template are provided for storing templates for several password texts. By this security can be increased since different password texts can be used and accordingly passwords can be changed. For this in a text-dependent speaker recognition system, during the training, the system is trained with some passwords

(speech signal segments) and the corresponding text information (information about what the password is) .

To increase security and the accuracy of speaker recognition according to a further embodiment of the invention the means for storing at least one speaker specific reference template are provided for storing a set of speaker specific reference templates for each password text.

According to an especially preferred further development of the invention a one-pass dynamic programming framework is provided for matching the input feature vector sequence against reference templates for several password texts, using multiple templates of each password text. This increases the accuracy of speaker recognition. That allows the invention to be useful for variable-text text-dependent speaker- recognition with multiple templates which results in improved performance (by handling intra-speaker variability adequately) and for input speech that is continuous with arbi- trary inter-word pauses.

To ease the determining and optimizing of the warping path according to a further embodiment of the invention the means for determining the respective local distance are provided for determining a grid of the respective local distances d w (i,j),ι=l,...,T x ,j =l,...,T y , where each column i contains the local distances between the clean frames y Jt j = 1,.. ,T y and the corresponding cleaned frames x u ,j = l, ,T y .

In order to improve the noise estimation according to a preferred further development of the invention the power spectral density of a frame of time domain noise sample of the most recent non-speech region of the input utterance of noisy speech is used as a noise estimate.

In a further embodiment of the invention the feature extraction means use mel-frequency-cepstral coefficients. This pro- vides for a particularly effective extraction of features.

According to an especially preferred further development of the invention speaker identification means are provided for comparing respective minimized accumulations of several speakers and identifying the speaker with a lowest accumulation. By this the speaker recognitipon system can be used by several speakers and the identification of any particular speaker is improved.

In order to improve authentication of any particular speaker according to a preferred further development of the invention means for speaker verification are provided for normalizing the accumulation of a speaker by a background speaker's score, computed between the input utterance and the respec- tive background speaker specific reference templates and for comparing the normalized score to a threshold.

According to a further embodiment of the invention the system is for use in a secure access control system, in particular in a secure access control system using voice under additive background noise in buildings, cars, offices, shop factories, or over the telephone in a variable-text text-dependent speaker-recognition mode where the user (or system) can change access passwords from time to time or in prompted-mode of speaker-recognition operation where the system uses randomly generated passwords every time the system is used for high security applications.

The present invention is further described hereinafter with reference to preferred embodiments shown in the accompanying drawings, in which:

FIG 1 shows the system architecture of a variable-text speaker-recognition system based on a one-pass dynamic programming matching algorithm,

FIG 2 shows a simplified illustration of the one-pass dynamic programming matching,

FIG 3 shows a schematic overview of an embodiment of the invention,

FIG 4, FIG 5 and FIG 6 show local distance matrixes between noisy input speech and clean templates,

FIG 7 and FIG 8 show closed-set speaker-identification accuracy using the one-pass dynamic programming algorithm,

FIG 9 shows the speaker-verification performance using the detection error trade-off curve,

FIG 10 shows the one-pass dynamic programming matching be- tween test utterance and multiple training templates corresponding to password text,

FIG 11 shows the one-pass dynamic programming matching with optional inter-word silences,

FIG 12 shows two types of recursions for word-templates and

FIG 13 shows recursions for an inter-word silence template.

In a text-dependent speaker recognition system, during the training, the system is usually trained with some passwords (speech signal segments) and the corresponding "text" information (information about what the password is) uttered by

the various speakers. During actual usage of the system, given the uttered password and the corresponding text information, the system attempts to identify the speaker.

Such systems typically have a front-end "feature extractor" which extracts some features from the given speech signal segments at the input. A "back-end" classifier then uses these extracted features to either train the system (during training phase) or enable the system to detect the speaker (during actual use) .

For such back-end classification, conventional text-dependent speaker recognition systems use either the Hidden Markov Model (HMM) based methods or the Dynamic Time Warping (DTW) method. DTW is described in L. R. Rabiner, B. H. Juang "Fundamentals of Speech Recognition", Prentice Hall, 1993. It is essentially a method in which two segments of signals, which are of different time duration, can be compared together with one signal being suitably time-warped and aligned to aid the comparison process. This way two signal segments can be compared even though they are not time-aligned or they have different length.

Conventional speaker identification systems try to cope with background noisy conditions in mainly two ways: a) Front-End Noise Cancellation and b) Modification of Extracted Features. Front-end noise cancellation essentially attempts to enhance the noisy speech with spectral subtraction

A conventional approach to dealing with noisy speech in applications such as speech recognition, text-independent speaker-recognition and speech coding is to apply noise- removal techniques such as spectral-subtraction or conventional Wiener filtering methods so as to get an enhanced speech signal prior to feature extraction. By this often artifacts (in the form of ringing & musical noise) are introduced. While spectral subtraction requires an estimate of the noise power spectral densities, typically from the most re-

cent non-speech region, Wiener filtering methods require estimates of both clean speech power spectrum and the noise power spectrum. There are a wide variety of Wiener filtering techniques depending on how the clean speech power spectrum estimate is obtained for any given noisy frame. These can be broadly categorized as based on spectral-subtraction or, estimates of signal spectrum from previous "cleaned" frames or, from model-based estimates such as linear-prediction, or using vector quantizer codebooks. These methods are typically employed in an iterative framework.

Noise removal becomes more difficult in case of non- stationary noise, where it is difficult to use spectral subtraction techniques that rely on noise-estimates obtained during non-speech regions, since the non-speech regions could be unreliable to detect or not present at all, or present so infrequently that the noise-estimate no longer matches with the time-varying noise characteristics in the speech regions. This is typically the case for speaker-recognition applica- tions such as access control to buildings, cars, offices etc. or speaker-authentication over telephones / mobiles (prior to secure tele-transactions) where a high degree of background non-stationary noise in the form of other people's speech, street noise etc. can be expected.

Wiener filtering is a more general approach, which requires estimates of both clean speech power spectrum and the noise power spectrum. In Wiener filtering approaches, the i-th frame of the noisy input signal is filtered by the impulse response of the filter H 1 (W)

P 1 (W)

Hi(W) =

P 1 (W) + P n (w)

Where, P n ( w ) is the estimate of the noise spectrum and P 1 (W) is the estimate of the clean speech spectrum for the i-th frame .

Unlike the typical filtering theory of designing a filter for a desired frequency response the Wiener filter approaches filtering from a different angle. For Wiener filtering addi- tional information regarding the spectral content of the original signal and the noise is required. Usually it is assumed that signal and (additive) noise are stochastic processes with known spectral characteristics or known autocorrelation and cross-correlation. As performance criteria usually the minimum mean-square error is used. An optimal filter can be found from a solution based on scalar methods. The goal of the Wiener filter is to filter out noise that has corrupted a signal by statistical means.

There are a wide variety of Wiener filtering techniques depending on how the clean speech power spectrum estimate is obtained for any given frame. These can be broadly categorized as based on spectral-subtraction or, estimates of signal spectrum from previous "cleaned" frames or, from model- based estimates such as linear-prediction, or using vector quantizer codebooks . These methods are typically employed in an iterative framework.

The hypothesized Wiener filtering (HWF) was originally pro- posed in A. D. Berstein, I. D. Shallom, "An hypothesized wiener filtering approach to noisy speech recognition", Proc. ICASSP' 91, pages 913-916, 1991, for robust speaker-dependent isolated word recognition in a DTW framework and has subsequently been adapted to HMM frameworks using state-based fil- tering for noisy speech recognition. However it has not been used for speaker-recognition applications so far.

The proposed method can be used for any variable-text text- dependent speaker-recognition application with additive back- ground noise which can be white noise or non-stationary color noise and which uses any of the three types of systems: closed-set speaker-identification, speaker-verification and open-set speaker-identification.

FIG 1 shows the typical architecture of the variable-text speaker-recognition system based on the one-pass dynamic programming (DP) matching algorithm proposed in V. Ramasubrama- nian et al., "Text-dependent speaker recognition systems based on one-pass dynamic programming algorithm", Proc. IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, Puerto Rico, June 2006. It is described how the proposed HWF algorithm is incorporated within the one-pass DP matching al- gorithm for noise-robust variable-text text-dependent speaker-recognition. The schematic of the speaker-recognition system based on the one-pass DP algorithm is shown in FIG 1. FIG 1 shows the matching for one speaker. Each speaker has a set of templates 9 for each word in the vocabulary. For exam- pie, for the word "nine", there are four templates 9: R9 1 ,

R 92/ R 9 3, R94 • Given an input utterance 1, the feature extraction module 2 converts the noisy input speech utterance 1 into a sequence 3 of feature vectors . According to the shown example the mel-frequency-cepstral coefficients (MFCCs) are used as the feature vector. This feature vector sequence 3 corresponds to the input "password" text 6 (e.g. the digit string 915) . For example, when the password text 6 "915" is spoken by the user, a corresponding sequence 3 of feature vectors is created and presented to the forced alignment mod- ule 4. The module 7 sets up one-pass DP recursions to create a concatenated set 8 of multiple reference templates. The corresponding concatenated set 8 of multiple reference templates 9 for "9", "1" and "5" along with the inter-word silence templates is also presented to the forced alignment module 4.

The one-pass DP algorithm with HWF in the forced alignment module 4 matches the input feature-vector sequence 3 against the word-models of "9", "1" and "5", using multiple tem- plates 9 per word and inter-word silence templates. The resulting match score 5 (D k * ) is the optimal distance between the noisy input utterance 1 and the word-templates 9 of speaker S k . For closed-set speaker-identification, this score

5 is computed for each speaker of the registered speakers in the system and the speaker with the lowest score 5 is declared the as the identified speaker. For speaker- verification, this score 5 corresponds to the match between the input utterance 1 and the claimed speaker S k 's templates 9. This score 5 is normalized by the background score, computed between the input utterance 1 and background speaker's word-templates 9. Hence a form of likelihood-ratio normalization on the one-pass DP score 5 is performed, by dividing D k ' with the impostor score computed between the input utterance and a background speaker closest to the input utterance from among the remaining speaker set. The normalized score 5 is compared to a threshold. The input speaker claim is accepted if the normalized score 5 is less than the threshold and re- jected otherwise. This is done for both target speakers and impostor speakers and the probabilities of false rejection and false acceptance for the given threshold are determined. This further yields the ROC (Receiver Operator Characteristics) or DET curve (Detection error trade-off) for varying thresholds.

The proposed HWF algorithm is preferably set within this one- pass DP framework. This enables use of multiple templates 9 for each word in the password so as to capture the intra- speaker variability adequately and also allow for handling continuous input speech with arbitrary inter-word pauses.

FIG 2 shows an example of matching of two sequences of feature vectors by one-pass DP. It shows a matching between the input utterance 13, 14, 15 (corresponding to the password text 6 "915") against the reference templates 23, 24, 25 "R 9 R 1 R 5 " for a speaker. The input utterance 13, 14, 15 is converted to a sequence of feature vectors 16, 17 X = [X 1 , X 2 , ..., Xi, ..., X Tx ] / each i-th feature vector X 1 coming from the i-th frame (segment of speech) of input speech. The reference template is a sequence of feature vectors 18 Y = [Y 1 , Y 2 , ..., Y 3 , ..., Y Ty ] . This is a simplified illustration of the one-pass DP matching used in text-dependent speaker-recognition as de-

scribed in V. Ramasubramanian et al., "Text-dependent speaker recognition systems based on one-pass dynamic programming algorithm", Proc. IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, Puerto Rico, June 2006.

As shown in FIG 2, the one-pass DP process optimally warps the template feature vector sequence Y to align it with the test feature vector sequence X. In this process, the warping function 26 j=f (i) is generated which relates the i-th frame of X to the j-th frame of Y such that the accumulated dis- t tance 27 (score) between them over the warping path is minimized. The resulting match score is the optimal distance between the input utterance and the word-templates. For speaker-identification, one such score is computed for each speaker and the speaker with the lowest score is declared the input speaker. For speaker-verification, this score corresponds to the match between the input utterance and the "claimed speaker's" models. This score is normalized by the background score, computed between the input utterance and background speaker's word-templates and the normalized score is compared to a threshold. The input speaker claim is accepted if the normalized score is less than the threshold and rejected otherwise.

FIG 2 further shows how the input utterance on the x-axis 11 (corresponding to the password text "915") is matched against the reference templates "R 9 Ri R 5 " on the y-axis 12 for a speaker. This is a simplified illustration of the one-pass DP matching which in actuality uses multiple templates of each word in the password text so that any of the multiple template of each word is selected for the optimal match and also uses inter-word silence templates so as to allow for interword pauses to be present or absent in the input utterance.

FIG 3 shows a schematic overview of an embodiment of the invention. Means 30 for storing at least one speaker specific reference template 31 are provided. A speaker specific reference template 31 comprises a sequence of frames 32 of time

domain speech samples from a clean speech utterance of a password text of the respective speaker 33. The clean speech utterance was for example recorded with means 54 for recording during a training period prior to the use of the speaker recognition system. A sequence 36 of frames 37 of time domain speech samples of an input utterance of noisy speech of the password text, by the same or a different speaker 33 are recorded with means 34, 35 for recording. Means 38 for filtering are used to filter the frames 37 of time domain speech samples of the input utterance of noisy speech with a hypothesized Wiener filter. For the hypothesized Wiener filter the power spectral density 39 of a frame 32 of the speaker specific reference template 31 is used as a signal estimate and the power spectral density 40 of a frame 41 of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate. The power spectral densities 39, 40 are computed with the respective modules 51, 53. The frame 41 of time domain noise sample of a non-speech region of the input utterance of noisy speech is determined with the module 52. The output 42 of the hypothesized Wiener filter are cleaned frames 43 of time domain speech samples of the input utterance of noisy speech. Feature extraction means 44 are used for converting a frame 32, 43 of time domain speech samples into a corresponding feature vector 47, 46. Means 45 are provided for determining the respective local distance between a feature vector 46 of a cleaned frame 43 and the feature vector 47 of the respective frame 32 of the speaker specific reference template 31 whose power spectral density 39 was used as a signal estimate for the filtering to get the cleaned frame 43, and for assigning the respective local distance to the respective pair of frame 37 of time domain speech samples of the input utterance of noisy speech and frame 32 of the speaker specific reference template 31. A warping path 49 for an optimized time alignment between a sequence of feature vectors of the input utterance of noisy speech and a sequence of feature vectors of the clean speech utterance of the respective speaker, minimizing the accumulation 50 of the respective as-

signed local distances over the warping path 49, is determined with a dynamic time warping module 48.

An embodiment of the proposed HWF based system and method for speaker recognition is described as follows. The input speech (e.g. FIG 2 x-axis 11) is represented by T x frames *, , x 2 , ... , x,,... x τ where X 1 is the sequence of speech samples of the ϊ h frame and the corresponding sequence of MFCC feature vectors is X 1 , X 2 , .... , X 1 ,..., X Tx .The sequence of MFCC feature vectors for the concatenated reference templates on the y-axis is Y 1 , Y 2 , .... , Y 1 ,..., Y 7 . . Let the power spectral density (psd) of the input speech be P x (W), P x (w) , ,

P x (yv) , ..., P x (w) and the psd of the concatenated reference templates be P n (W), P y1 M, , P (w), .... , P yr> (w). Let P n (W) be the noise-estimate obtained from the most recent non-speech region of the input noisy speech.

The DTW matching is an optimal time-alignment between the input utterance and the reference templates wherein the warping function j = f(i) relates the i' h frame of the input utterance to the j' h frame of the reference templates such that the accumulated distance between frame i of input utterance and frame j of reference templates over the warping path is minimized. The point (i,j) 19 is associated with the minimum accu- mulated distortion D A (i,j) which is given by the recursion (as shown in FIG 2) .

DAiJ)=kem{j,J-ihnj-2)[A,0-U ) +<UM ) ]

where J(X y ,y ; ) is the Euclidean distance between the MFCC vectors X 1 , and Y 1 . While Y 1 is the MFCC vector of the j' h frame of the concatenated reference template, X v is the MFCC vector of the speech signal of the i' h frame given by x y obtained by Wiener filtering the input noisy frame X 1 using the

Wiener filter frequency response given by

W^w) = - y>

P yj M+P n M

This is done by computing the psd P (w) = P (w) • W 1 (w) and ob-

XfI taining \ tJ as a frame of time domain samples corresponding to the psd P- (w) . The MFCC vector X 11 is then obtained from x y . By this, the DTW / one-pass DP algorithm is computed on a grid of local distances d w (i,j),i = \,...,T x ,j = \,...,T y , where each column i contains the local distances between the "clean" frames y r j = \,...,T y and the corresponding "cleaned" frames x v ,j = \,...,T y . Clearly, since the reference template is the "clean" version of the input noisy speech, a column i will have the lowest d w (i,j) for that j which corresponds to the noisy frame i within a non-linear warping factor. Thus the DTW will now find the optimal warping path j = f'(i) which minimizes the accumulated distortion D' = D A (T x ,T y ) given by

D ' =min£ t d w (β,f(ι))

This D' corresponds to the HWF score D k ' for speaker k, i.e., when the y-axis 12 of FIG 2 uses the speaker k "clean" templates. Speaker-identification and speaker-verification are done using the score D k * as described earlier (see FIG 1) .

D k * will be the lowest for the correct speaker, though for each speaker, the y-axis 12 in FIG 2 uses the same "password", (i.e., R 9 Ri R 5 for the example shown) but which are speaker k' s word templates. Here, the HWF exploits the fact that the correct speaker's reference templates provide a better "cleaning" up of the input noisy speech (x-axis) and hence in lower local distances along the optimal path for that speaker than for other speakers.

This is illustrated in FIG 4, FIG 5 and FIG 6, where the matrix 60, 63, 66 of local distances d w (i,j),i = \,...,T x ,j = \,...,T y is

plotted for a single digit (digit 8) test utterance on x-axis 61, 64, 67 for the three cases:

(i) FIG 4: No HWF is performed, but y-axis 62 with clean ref- erence template of the same speaker as the input noisy speech (NR) ,

(ii) FIG 5: HWF performed with y-axis 65 having the clean reference template of the same speaker as the input noisy speech (NR) ,

(iii) FIG 6: HWF performed with y-axis 68 having the clean reference template of a speaker (RR) who is not the speaker of the input noisy speech (NR) .

Clearly, it can be noted that case (i) performs poorly though the same speaker template is used. The conventional DTW (without HWF) is unable to find any good optimal path for matching and this results in the loss of recognition accuracy as the input speaker becomes confusable with other speakers. Case (ii) shows that HWF is able to find a good optimal path after using local distances derived after the HWF operation on the noisy speech using the correct speaker's clean template. Case (iii) again performs poorly with no good optimal path as there is a mismatch between the speaker of the input speech and the reference templates. HWF actually adds to the discriminability by ensuring that the local distances resulting from case (iii) are higher than in case (ii) , due to the fact that, in case (iii), the noisy input frames are filtered by "clean" spectra of some other speaker (though the textual content is same) . FIG 4, FIG 5 and FIG 6 also show the respective matching score Z )* for each of these cases, clearly validating the above differences.

It should be noted that this is a far more demanding requirement than the isolated word recognition (IWR) task on which HWF was originally proposed for (and has been used so far) . In the case of using HWF for IWR, the "correct word" natu-

rally provides a better "cleaning" up and hence a lower DTW score than an "incorrect word" whose spectral content obviously does not match the input speech. In contrast, the HWF 's task in speaker-recognition is all the more difficult since, for a given speaker, the DTW-HWF algorithm needs to provide a better match only when there is a "speaker match", despite having the same "word-content" between the x-axis and y-axis for all speakers.

It should also be noted that the cleaned frames x υ ,i = l,...,T x with j = f * (i) as determined by the optimal DTW warping corresponds to the cleaned speech using the proposed HWF algorithm. Thus, the proposed method also provides an "enhancement" of the input noisy speech as a by-product of speaker- recognition, particularly for the correct speaker identified by the system.

The features extracted xy,i = I,...,T x as given above correspond to robust features from the enhanced speech. Thus the proposed method also provides for extraction of robust features as a by-product of the speaker-recognition. For instance, the MFCC vectors λf v corresponding to x ιy as defined above, correspond to a robust set of MFCC features from the input noisy speech, particularly for the correct speaker identified by the sys- tern.

In the following results of the HWF algorithm, used within the one-pass DP framework for text-dependent speaker recognition are presented. The HWF algorithm is evaluated for both speaker-identification (closed-set) and speaker-verification on eight speakers in the TIDIGITS database, which has an eleven word vocabulary {"oh", 0-9}. The clean templates were extracted from the 7-digits strings and the test data consists of 3-, 4-, and 5-digit strings with eleven utterances each per speaker. These experiments adequately bring out the basic performance potential of HWF. It is also compared with conventional spectral-subtraction of the input noisy speech before feature extraction in FIG 1. These algorithms (along

with the baseline performance of "noisy speech" without any noise-removal) are evaluated for clean test data and for additive white noise of SNRs 0 dB, 5 dB and 10 dB (SNR = Signal to Noise Ratio) and non-stationary noises (factory, chopper and babble) of 0 dB SNR (from NOISEX92 database) .

Comparisons with spectral subtraction bring out an important difference between spectral subtraction and Wiener filtering. Spectral subtraction depends solely on obtaining a noise- estimate (from the most recent non-speech region) and subtracting it from successive speech spectra and then generating the enhanced speech by overlap-add-synthesis method. The performance of spectral subtraction therefore depends only on how well the noise-estimate matches the noise spectra of the noisy speech so that spectral subtraction removes the noise. In fact, this works quite well when the noise is stationary colored noise, such as car-noise, which allows the spectral subtraction to effectively subtract out the color noise spectra from the noisy speech regions using the noise-estimate which correctly has the form of a spectral envelope of the stationary color noise.

However, when the input noise is white noise (or non- stationary color noise) , the spectral subtraction technique fails completely since the noise-estimate obtained from one non-speech region no longer matches the white noise spectra in a subsequent noisy speech region. The white noise estimate exhibits random spectral variations about a flat spectral envelope and therefore does not subtract out a similar flat, but equally random white noise spectra in a noisy speech region. Thus, subtraction actually leaves behind a remnant white noise spectra and at best (when the noise-estimate becomes more and more flat due to longer time-averages in the non-speech region) results in the original noisy speech spec- tra to have a reduced spectral average, equivalent to an overall attenuation of the noisy speech without any enhancement; the resultant speech therefore provides no improved recognition accuracy.

Spectral subtraction behaves in a similar way for non- stationary color noise also, where the color noise spectra is time varying and the noise-estimate used by the spectral sub- traction (from the most recent non-speech region) no longer matches (and is hence unable to subtract out) the time- varying color noise spectra in the noisy speech in subsequent speech regions. In contrast, since the Wiener filter uses the clean speech estimate in addition to the noise-estimate, Wie- ner filtering is able to provide an improved enhancement by virtue of having a good approximation of the underlying speech spectrum during the noisy speech period even in such conditions when the noise-estimate is inadequate to correctly represent the current noise spectra in the speech regions.

These differences are brought out in the following experimental results for white noise. FIG 7 and FIG 8 show the closed- set speaker-identification accuracy (in percent on y-axis 71, 81) using the one-pass DP algorithm with one and five tem- plates per word respectively for test data SNRs of 0 dB, 5 dB and 10 dB (x-axis 70, 80) . The respective clean speech value is shown as point 72, 82. It can be seen that, while the noisy speech (Noisy, curve 75, 85) has a very poor performance, spectral subtraction (SS, curve 74, 84) provides only a marginal improvement. However, HWF (curve 73, 83) has a significantly high performance, clearly validating the effectiveness of the proposed method for speaker-identification for all the SNRs considered here. The performance improvement (about 10%) from using one to five templates in the one-pass DP algorithm can also be noted.

FIG 9 shows the speaker-verification performance using the DET (Detection error trade-off) curve. On the x-axis 91 the probability of false acceptance in percent is given, whereas the y-axis 92 shows the probability of false rejection in percent. The curves 95, 94, 93 show the results for noisy speech, SS and HWF respectively. The one-pass DP algorithm is used with five templates for a test data SNR of 0 dB. It can

be noted that, as in speaker-identification, the performance of noisy speech is very poor and spectral subtraction does not improve this. On the contrary, HWF offers an excellent improvement with a highly lowered EER (Equal-error-rate) where the probability of false acceptance (p fa ) equals the probability of false rejection (p fr ) . Table 1 shows the EER points (Pfa r Pf r ) for various SNRs (0 dB, 5 dB and 10 dB) for all the three cases - Noisy speech (NOISY) , spectral subtraction (SS) and the proposed HWF algorithm (HWF) using five templates in the one-pass DP algorithm (as in FIG 9) . Here again, it can be noted that HWF offers the best performance improvement, while SS performs as poorly as the NOISY case itself.

TEST NOISY SS HWF

SNR Pfa Pfr Pfa Pfr Pi a Pfr

0 dB 48 11 47 . 35 43 . 56 43 . 18 7 95 7 20

5 dB 46 59 44 . 70 36 . 74 34 . 85 3 79 4 17

10 dB 42 05 42 . 80 28 . 79 31 . 82 2 27 1 89

Table 1: Speaker-verification EER points (Pf a , Pfr) for test SNRs (0 dB, 5 dB, 10 dB) ; clean EER= (0,0)

In order to show the effectiveness of HWF on non-stationary noise as discussed earlier, the closed-set speaker- identification performance of the algorithm is evaluated on three types of noises, namely, factory noise, chopper noise, and babble noise for a test data SNR of 0 dB for the same set of speakers in TIDIGITS as above (and with five templates in the one-pass DP algorithm) . Table 2 shows the speaker- identification accuracy for the noisy speech (NOISY) , spectral subtraction (SS) and the proposed HWF algorithm. It can be observed that the performance of noisy speech is poor, and spectral subtraction achieves only modest relative improve- ments over the noisy case. However, HWF has an excellent performance offering a large improvement over the noisy and SS cases .

NOISE-TYPE NOISY SS HWF

Factory 17 04 28 .41 95. 45

Chopper 29 54 48 .86 88. 63

Babble 52 27 79 .54 96. 50

Table 2: Speaker-identification accuracy (%) for three non- stationary noises for test SNR of OdB; clean accuracy=100%

FIG 10 illustrates the use of multiple templates in the proposed one-pass DP forced alignment between the input utterance 106 (on the x-axis 100) and the word-templates 102, 103, 104, 105 (on the y-axis 101) . The same example password of

"915" as in FIG 1 is used. Even though multiple templates are being used for all the words, here for the sake of clarity, only the multiple templates 103, 104 of the word "1" are shown on the y-axis. From the best warping path 107 obtained by the one-pass DP algorithm in this example, it is seen here that the template 104 ('2') of the word "1" (Ri, 2 ) had been chosen as the best matching template for that part (word "1") of the input utterance 106.

FIG 11 illustrates a typical matching by the proposed one- pass DP algorithm with templates 120, 121, 122, 123 for inter-word silences. In this example, it is assumed that the input utterance 113, 114, 115, 116 (on the x-axis 111) is the same ("915") as in FIG 1, but it is spoken with silence 114 before "9", silence 115 between "1" and "5" and silence 116 after "5". There is no inter-word silence between "9" and "1", representing an inter-word co-articulation. The one-pass DP algorithm uses (on the y-axis 112) concatenated "multiple" templates 117, 118, 119 of each word in the password "915" as in FIG 10, but with a silence template 120, 121, 122, 123 between adjacent words (for the sake of clarity and also to emphasize the handling of inter-word silence, only one template per word is shown in FIG 11) . The one-pass DP recursions now allow for entry into any word either from a silence template

or one of the multiple templates of the predecessor words. FIG 11 shows how the one-pass DP algorithm now correctly decodes the input utterance 113, 114, 115, 116 skipping the silence template 121 between word "9" and "1", resulting in the warping path 110. The inter-word silences 114, 115, 116 are mapped to the corresponding silence templates 120, 122, 123.

In the following the dynamic programming recursions of the proposed one-pass DP algorithm using HWF are given, for the combined case of multiple templates and inter-word silence, illustrating how the warping paths 107, 110 (shown in FIG 10 and FIG 11) are realized jointly. The recursions for two specific parts, one for word-templates and the other for the inter-word silence templates, are presented next.

FIG 12 shows the two main types of word template recursions: Within-word recursion 125 and across-word recursion 126 for a general case of any word template, but in the context of the password-sequence "915". The general equations for these two types of recursions are:

Within-word recursion:

D(m,n,v) = d w (m,n,v) + min [D(m-l,j,v)] n-2<=j<=n

Across-word recursion:

D(m,l,v) = d w (m,l,v) + min {D (m-1, 1, v) , min D(m-l,N u ,u)} uεPred' (v)

Here, D(m,n,v) is the minimum accumulated distortion by any path reaching the grid point defined as frame "n" of word- template "v" 127 and frame "m" 128 of the input utterance. d w (m,n,v) is the local distance between the n-th frame of word-v template and m-th frame of the input utterance ob- tained subsequent to Wiener filtering of frame-m of the noisy input utterance with the Wiener filter that uses the frame-n of word-template v of the particular speaker's clean templates, as defined above. The within-word recursion applies

to all frames of word v template 127, which are not the starting frame (i.e., n>l) . The across-word recursion applies to frame 1 of any word-v to account for a potential "entry" into word v template from the last frame N u of any of the other words {u} which are valid predecessors of word-v; i.e., Pred' (v) = {Silence template R sl i, Pred(v)}. These are the valid predecessors of any word v consisting of a silence template Rsii 129 and the multiple templates Pred(v) 130 of the word preceding the word v in the "password" text. For in- stance, if the "password" text is 915, and v=5, then

Pred' (v=5) = {R sil , Rn, R 12 , R1.3, R14 ) ; likewise, Pred' (v=l) = {Rsii, R91/ R92r R93/ R94}- This across-word recursion takes care of entry into any template of any word from a preceding silence template or from any template of the preceding word in the password text.

FIG 13 shows recursions for an inter-word silence template. This is illustrated for the transition from any of the four templates 135 of word "1" to the silence template 136 between words "1" and "5". The within-word recursion 137 and across- word recursion 138 in this case are:

Within-word recursion:

D(m,n,v) = d w (m,n,v) + min [D(m-l,j,v)] n-2<=j<=n

Across-word recursion:

D(m,l,v) = d w (m,l,v) + min{D(m-l, 1, v) , min D(m-l,N u ,u)} uεPred (v)

Here, all terms are same as in the recursions given above, except the definition of Pred(v), where v is the inter-word silence template R sil 136 between two consecutive words in the password. Thus, Pred(v) is the set of the multiple templates 135 of the preceding word in the "password" text. For instance, if the "password" text is 915, then Pred(v = R sll between 1 and 5) = (R 11 , R 12 , Rn, RuJ/ i.e., the four templates of word "1".

The above recursions together describe the one-pass DP recursion for using multiple templates and inter-word silence templates for forced alignment matching as required in the vari- able-text speaker-recognition. The best score (lowest) among D(T,N r ,r), r=l, ..., L+l, where T is the last frame of the input utterance and r=l,...,L+l refers to the L multiple templates of the last word in the password text and the last silence template (with N r as their respective last frames) yields the minimum accumulated distance D 1 of the match between the input utterance and the "password" text and is used as the score for that speaker i whose word-templates were used.

FIG 2 showed a matching between the input utterance (corre- sponding to the password text "915") against the reference templates "Rg Ri R5' 1 for a speaker. The input utterance is converted to a sequence of feature vectors X = [Xi, X 2 , ..., Xi,

—, X TX 11 each i-th feature vector X 1 coming from the i-th frame (segment of speech) of input speech. The reference tem- plate is a sequence of feature vectors Y = [Yi, Y 2 , ..., Y 3 , ...,

Y Ty ] . This was a simplified illustration of the one-pass DP matching with multiple templates.

As shown in FIG 2, the central aspect of the above one-pass DP algorithm with HWF is the optimal warping of the template feature vector sequence Y to align it with the noisy test feature vector sequence X while performing hypothesized Wiener filtering implicitly on every noisy frame of the input utterance with the clean frames of the reference templates of a speaker. In this process, the warping function j=f(i) is generated which relates the i-th frame of X to the j-th frame of Y such that the accumulated distance (score) between them over the warping path is minimized. The resulting match score is the optimal distance between the noisy input utterance and , the clean word-templates of a speaker with the underlying HWF providing means for achieving noise robustness in finding the correct speaker.

Above described are the HWF operations for the above example, which assumes a conventional dynamic time warping recursion. Here, the exact HWF operations are described as incorporated within the one-pass DP recursions given above.

The noisy input speech (FIG 2 x-axis 11) is represented by T x frames X 1 1 X 2 , , x m , x τ where x m is the sequence of speech samples of the m λ frame and the corresponding sequence of MFCC feature vectors is X 1 , X 2 , . , X n , , X . Let the power spectral density (psd) of the noisy input speech be P (w) , P 11 (W), • . , P Im (w) , , P Xn (w) . Let P N (w)be the noise-estimate obtained from the most recent non-speech region of the input noisy speech.

The one-pass DP matching is an optimal time-alignment between the input utterance and the reference templates wherein the warping function ;j=(n,v) = f (m) relates the m-th frame of the input utterance to the j-th frame of the clean reference templates (where the 3-th frame is frame-n of word template v of the clean reference template of a speaker) , such that the accumulated distance between frame m of input utterance and frame j of reference templates over the warping path is minimized as shown above. The point (m, j=(n,v)) is associated with the minimum accumulated distortion D(m,n,v) which is given by the recursions shown above. In these recursions, the local distance d w (m,n,v) is given by

d w (m,n,v) = d(X mj ,Y J )

where d(X m j,Y j ) is the Euclidean distance between the MFCC vectors X m> and Y 1 .

While Y j IS the MFCC vector of the /* frame (i.e., the n-th frame of the word-template v) of the concatenated clean ref- erence templates, X mj is the MFCC vector of the "cleaned" speech signal of the m-th frame given by x mj , obtained by Wiener filtering the input noisy frame x m using the Wiener

filter frequency response given by

This is done by computing the psd P (W) = P x (w)W(w) and ob- taining x mj as a frame of time domain samples corresponding to the psd P- (w) . The MFCC vector X mj is then obtained fromϊ^ . By this, the one-pass DP algorithm is computed on a grid of local distances d w (m,n,v),m = l,...,T x ,n = l,...,N v ,v = l,...,Z , for L templates per word and a template v with ν v frames for all words in the password text. Here, each column m contains the local distances between the "clean" frames y } ,j = l,...,T y (i .e . , for n=l,..., N v , v=l,...,L for all words in the password text) and the corresponding "cleaned" frames x mj ,j = \,...,T y . Clearly, since the reference template is the "clean" version of the input noisy speech, a column m will have the lowest d w (m,n,v) for that j=(n,v) which corresponds to the noisy frame m within a non-linear warping factor within the appropriate word-template v. Thus the one-pass DP will now find the opti- mal warping path j = f'(m) which minimizes the accumulated distortion D* as can be given by

D' =min∑d w (m,f{m))

This D corresponds to the HWF score D k for speaker k, i.e., when the y-axis 12 of FIG 2 uses the speaker k "clean" templates. Speaker-identification and speaker-verification are done using the score D k * as described earlier.

Summarizing the present invention relates to a system and method for speaker recognition. In order to improve speaker recognition in noisy environments the proposed method comprises the step of filtering frames 37 of time domain speech samples of an input utterance of noisy speech with a hypothe- sized Wiener filter, wherein for the hypothesized Wiener filter the power spectral density 39 of a frame 32 of the

speaker specific reference template 31 is used as a signal estimate and the power spectral density 40 of a frame 41 of time domain noise sample of a non-speech region of the input utterance of noisy speech is used as a noise estimate, the output 42 of the hypothesized Wiener filter being cleaned frames 43 of time domain speech samples of the input utterance of noisy speech