Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM, APPARATUS, AND METHOD FOR PERFORMING SPEAKER VERIFICATION USING A UNIVERSAL BACKGROUND MODEL
Document Type and Number:
WIPO Patent Application WO/2017/157423
Kind Code:
A1
Abstract:
An apparatus (102/108), machine-implemented method (300), and computer program product for determining whether an utterance, U, was spoken by a hypothesized speaker, S, are presented. The apparatus includes a module (110a/110b) that obtains data corresponding to an utterance captured by a microphone. It also obtains a set of two or more Gaussian Mixture Models (GMMs), where each GMM comprises a set of parameter pairs. The module further selects, based on the obtained data corresponding to the captured utterance U, a subset of a first set of parameter pairs of the first GMM and a subset of a second set of parameter pairs of the second GMM. The two subsets are combined to form a Universal Background Model (UBM), λUBM. The module obtains a GMM model, λS, associated with the hypothesized speaker, S. The segment λS, λUBM, and the obtained data corresponding to U are used to determine whether U was spoken by S.

Inventors:
GRANCHAROV VOLODYA (SE)
KARLSSON ERLENDUR (SE)
SVERRISSON SIGURDUR (SE)
POBLOTH HARALD (SE)
Application Number:
PCT/EP2016/055564
Publication Date:
September 21, 2017
Filing Date:
March 15, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (PUBL) (SE)
International Classes:
G10L17/06; G10L17/04
Foreign References:
EP2048656A12009-04-15
Other References:
DOUGLAS REYNOLDS: "Universal Background Models", 1 February 2008 (2008-02-01), XP055282684, Retrieved from the Internet [retrieved on 20160622]
DOUGLAS A. REYNOLDS ET AL: "Speaker Verification Using Adapted Gaussian Mixture Models", DIGITAL SIGNAL PROCESSING., vol. 10, no. 1-3, 1 January 2000 (2000-01-01), US, pages 19 - 41, XP055282688, ISSN: 1051-2004, DOI: 10.1006/dspr.1999.0361
ROSENBERG A E ET AL: "Speaker background models for connected digit password speaker verification", 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING - PROCEEDINGS. (ICASSP). ATLANTA, MAY 7 - 10, 1996; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING - PROCEEDINGS. (ICASSP)], NEW YORK, IEEE, US, vol. 1, 1 January 1996 (1996-01-01), pages 81 - 84, XP002079921, ISBN: 978-0-7803-3193-8, DOI: 10.1109/ICASSP.1996.540295
YI-HSIANG CHAO ET AL: "Improving GMM-UBM speaker verification using discriminative feedback adaptation", COMPUTER SPEECH AND LANGUAGE., vol. 23, no. 3, 1 July 2009 (2009-07-01), GB, pages 376 - 388, XP055282690, ISSN: 0885-2308, DOI: 10.1016/j.csl.2009.01.002
HECK L P ET AL: "HANDSET-DEPENDENT BACKGROUND MODELS FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION", 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. SPEECH PROCESSING. MUNICH, APR. 21 - 24, 1997; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], LOS ALAMITOS, IEEE COMP. SOC. PRESS,, 21 April 1997 (1997-04-21), pages 1071 - 1074, XP000822636, ISBN: 978-0-8186-7920-9, DOI: 10.1109/ICASSP.1997.596126
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS:

1. A machine implemented method (300) for determining whether an utterance, U, was spoken by a hypothesized speaker, S, the method comprising:

obtaining (302) data corresponding to an utterance, U, captured by a microphone (104); obtaining (304) a set of two or more Gaussian Mixture Models (GMMs), wherein said set of GMMs comprises a first GMM, λΑ, comprising a first set of parameter pairs, each parameter pair included in the first set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix, and wherein said set of GMMs further comprises a second GMM, λΒ, comprising a second set parameter pairs, each parameter pair included in the second set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix;

selecting (306), based on the obtained data corresponding to the captured utterance U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs; combining (308) the first subset of parameter pairs with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs;

obtaining (310) a GMM model, Xs, associated with the hypothesized speaker, S;

using (312) Xs, XUBM, and the obtained data corresponding to the captured utterance U to determine whether U was spoken by S.

2. The method of claim 1 , wherein the first GMM is a first speaker-independent GMM trained with speech samples from a first set of speakers.

3. The method of claim 1 or 2, wherein the second GMM is a second speaker- independent GMM trained with speech samples from a second set of speakers.

4. The method of any one of claims 1 -3, wherein the step of selecting the parameter pair subsets based on the data corresponding to the captured utterance comprises: using (402) the data to obtain a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,xn, wherein n > 1;

for each feature vector xl ..,xn, assigning (404) the feature vector to one and only one of the GMMs included in said set of GMMs;

determining (406) Ti,wherein Ti is the total number of feature vectors assigned to the first GMM;

determining (408) T2, wherein T2 is the total number of feature vectors assigned to the second GMM;

determining (410) a value i based on Ti and n;

determining (412) a value N2 based on T2 and n;

selecting (414) i parameter pairs from the first set of parameter pairs; and

selecting (416) N2 parameter pairs from the second set of parameter pairs, wherein λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N2 parameter pairs from the second set of parameter pairs.

5. The method of any one of claims 1-3, wherein the data corresponding to the captured utterance comprises a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,xn, wherein n > 1 , and wherein the step of selecting the parameter pair subsets based on the data corresponding to the captured utterance comprises:

for each feature vector xl ..,xn, assigning the feature vector to one and only one of the

GMMs included in said set of GMMs;

determining Ti, wherein Ti is the total number of feature vectors assigned to the first

GMM;

determining T2, wherein T2 is the total number of feature vectors assigned to the second GMM;

determining a value i based on Ti and n;

determining a value N2 based on T2 and n;

selecting i parameter pairs from the first set of parameter pairs; and

selecting N2 parameter pairs from the second set of parameter pairs, wherein

λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N2 parameter pairs from the second set of parameter pairs.

6. The method of claim 4 or 5, wherein selecting the i parameter pairs from the first set of parameter pairs comprises:

for each parameter pair included in the first set of parameters pairs, determining an accumulated posterior probability;

selecting from the first set of parameter pairs the i parameter pairs having the highest accumulated posterior probability. 7. The method of claim 6, wherein determining an accumulated posterior probability for a parameter pair included the first set of parameter pairs comprises:

for each feature vector χι,.,.,χτι assigned to the first GMM, calculating a posterior probability using the feature vector and said parameter pair; and

summing said calculated posterior probabilities.

8. The method of any one of claims 1-4, wherein using the data corresponding to the captured utterance U to obtain a set of feature vectors X comprises dividing the data into a plurality of signals Si, ...sn corresponding to different frames in time and, for each signal Si, calculating a plurality of parameter values that describe characteristics of the signal, wherein the corresponding vector, Xi, consists of the plurality of parameter values.

9. The method of claim 8, wherein the plurality of parameter values consists of one or more values describing frequency content of the signal Si, and one or more values describing energy content of the signal s;.

10. The method of claim 4, wherein:

determining i based on Ti and n comprises multiplying Ti/n by Ki, rounded to the nearest whole number, wherein Ki is a total number of parameter pairs in the first set of parameter pairs, and determining N2 based on T2 and n comprises multiplying T2/n by K2, rounded to the nearest whole number, wherein K2 is a total number of parameter pairs in the second set of parameter pairs. 1 1 . An apparatus (102/108) for determining whether an utterance, U, was spoken by a hypothesized speaker, S, the apparatus comprising one or more processors (755/855) configured to:

obtain data corresponding to an utterance, U, captured by a microphone (104);

obtain a set of two or more Gaussian Mixture Models (GMMs), wherein said set of GMMs comprises a first GMM, λΑ, comprising a first set of parameter pairs, each parameter pair included in the first set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix, and wherein said set of GMMs further comprises a second GMM, λΒ, comprising a second set parameter pairs, each parameter pair included in the second set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix;

select, based on the obtained data corresponding to the captured utterance U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs; combine the first subset of parameter pairs with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs;

obtain a GMM model, Xs, associated with the hypothesized speaker, S; and

use Xs, XUBM, and the obtained data corresponding to the captured utterance U to determine whether U was spoken by S.

12. The apparatus (102/108) of claim 1 1 , wherein the first GMM is a first speaker- independent GMM trained with speech samples from a first set of speakers.

13. The apparatus (102/108) of claim 1 1 or 12, wherein the second GMM is a second speaker-independent GMM trained with speech samples from a second set of speakers.

14. The apparatus (102/108) of any one of claims 1 1-13, wherein the one or more processors (755/855) are configured to select the parameter pair subsets based on the data corresponding to the captured utterance by:

using the data to obtain a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,xn, wherein n > 1 ;

for each feature vector xl ..,xn, assigning the feature vector to one and only one of the GMMs included in said set of GMMs;

determining Ti, wherein Ti is the total number of feature vectors assigned to the first

GMM;

determining T2, wherein T2 is the total number of feature vectors assigned to the second

GMM;

determining a value Ni based on Ti and n;

determining a value N2 based on T2 and n;

selecting i parameter pairs from the first set of parameter pairs; and

selecting N2 parameter pairs from the second set of parameter pairs, wherein

λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N2 parameter pairs from the second set of parameter pairs.

15. The apparatus (102/108) of any one of claims 1 1-13, wherein the data corresponding to the captured utterance comprises a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,xn, wherein n > 1 , wherein the one or more processors are configured to select the parameter pair subsets based on the data corresponding to the captured utterance by:

for each feature vector xl ..,xn, assigning the feature vector to one and only one of the GMMs included in said set of GMMs;

determining Ti, wherein Ti is the total number of feature vectors assigned to the first

GMM;

determining T2, wherein T2 is the total number of feature vectors assigned to the second

GMM;

determining a value i based on Ti and n;

determining a value N2 based on T2 and n; selecting i parameter pairs from the first set of parameter pairs; and

selecting N2 parameter pairs from the second set of parameter pairs, wherein

λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N2 parameter pairs from the second set of parameter pairs.

16. The apparatus (102/108) of claim 14 or 15, wherein the one or more processors (755/855) are configured to select the i parameter pairs from the first set of parameter pairs by: for each parameter pair included in the first set of parameters pairs, determining an accumulated posterior probability;

selecting from the first set of parameter pairs the i parameter pairs having the highest accumulated posterior probability.

17. The apparatus (102/108) of claim 16, wherein the one or more processors (755/855) are configured to determine an accumulated posterior probability for a parameter pair included the first set of parameter pairs by:

for each feature vector xi,...,xTi assigned to the first GMM, calculating a posterior probability using the feature vector and said parameter pair; and

summing said calculated posterior probabilities. 18. The apparatus (102/108) of any one of claims 10-14, wherein the one or more processors (755/855) are configured to use the data corresponding to the captured utterance U to obtain a set of feature vectors X by dividing the data into a plurality of signals si, ...sn corresponding to different frames in time and, for each signal Si, calculating a plurality of parameter values that describe characteristics of the signal, wherein the corresponding vector, Xi, consists of the plurality of parameter values.

19. The apparatus (102/108) of claim 18, wherein the plurality of parameter values consists of one or more values describing frequency content of the signal Si, and one or more values describing energy content of the signal s;.

20. The apparatus (102/108) of claim 14, wherein:

the one or more processors (755/855) are configured to determine Ni based on Ti and n by multiplying Ti/n by Ki, rounded to the nearest whole number, wherein Ki is a total number of parameter pairs in the first set of parameter pairs, and

the one or more processors (755/855) are configured to determine N2 based on T2 and n comprises multiplying T2/n by K2, rounded to the nearest whole number, wherein K2 is a total number of parameter pairs in the second set of parameter pairs.

21. A computer program product (741/841) comprising a non-transitory computer- readable medium (742/842) storing a computer program for determining whether an utterance,

U, was spoken by a hypothesized speaker, S, the computer program comprising computer readable instructions (744/844) which, when executed by one or more processors (755/855), cause the one or more processors to carry out the machine implemented method of claim 1. 22. A non-transitory computer-readable medium (742/842) storing computer-readble instructions (744/844) for a computer program that determines whether an utterance, U, was spoken by a hypothesized speaker, S, the computer readable instructions (744/844) being adapted to cause one or more processors (755/855) to carry out the machine implemented method of claim 1.

Description:
SYSTEM, APPARATUS, AND METHOD FOR PERFORMING SPEAKER VERIFICATION USING A UNIVERSAL BACKGROUND MODEL

TECHNICAL FIELD

[001] The present disclosure relates to performing speaker verification using a universal background model.

BACKGROUND

[002] Machine -based speaker verification may be used to verify the claimed identity of a speaker (e.g., an enrolled speaker in a voice verification program). The enrolled speaker may also be referred to as a target speaker. One way to formulate this problem is through a 2-class Hypothesis test, involving Ho and Hi. Ho is the hypothesis that a particular speaker is the target speaker. Hi is the hypothesis that the particular speaker is not the target speaker. The machine may apply different models to information derived from the speaker's speech (or other utterance) to determine which hypothesis to accept.

[003] The models may include a first model, denoted as λ εΏΙ or λ 8 , that represents statistics of an enrolled speaker's speech or other utterance and trained using voice samples (e.g., speech samples) from the enrolled speaker. The training may be done using x enr , which may denote the total feature space (e.g., large number of D dimensional feature vectors) of an enrolled speaker available for offline training. The models may further include a second model, λυΒΜ, that represents statistics of the space of imposter speakers. This second model may be referred to as Universal Background Model (UBM), because it may be a single mixture model that captures a total space of imposter speakers.

[004] After an utterance U from a speaker is captured, feature vectors may be extracted from the utterance. The feature vectors may, e.g., include a total of n vectors, and may be denoted as feature sequence χ =[ χ ι,Χ2,· ··, χ η ] . Each vector x n may have a dimension of D. At the verification stage, Ho and Hi are tested with the feature sequence x. This is done by calculating the log-likelihood of x, given the models λ, to construct:

A(x) = log (p(x| X em )) - log (p(x I λ υΒ Μ ))

(Equation 1) [005] The log-likelihood distance Λ measures how much better an enrolled speaker model scores for a captured utterance compared to an imposter's model. In one example, if Λ(χ) > Θ, Ho may be accepted. Otherwise, Hi may be accepted. In this instance, Θ is an offline optimized threshold level.

[006] One approach for modeling distributions of feature space in text-independent speaker recognition applications is through Gaussian mixture models, GMMs. In that context, a model λ is a GMM that is a mixture of K Gaussian components that have respective weights, mean vectors, and covariance matrices. The GMM may be denoted as λ: {wk, ∑k} k=l . . . K.

[007] In other words, λ may be a probability distribution that is modeled as superposition of K components (e.g., K Gaussian density functions <¾), with weights Wk. The log-likelihood of feature sequence x=[x 1 , x 2 , . . . , x n ] for this model may be calculated with the illustrated equation below:

(Equation 2)

[008] The summation over the N feature vectors accumulates contributions from individual features vectors x n .

[009] The components Φ ¾ · are determined by a mean vector μ and covariance matrix∑ k (the sigma symbol in the equation below denotes the covariance matrix, and not the summation operation):

(Equation 3)

[0010] A GMM λ εη τ for an enrolled speaker can be considered a model of the underlying broad phonetic sounds that characterize the speaker's voice. It may have, e.g., 64 components, and may be trained on the available audio data for each particular speaker.

[0011] A GMM λυΒΜ for the space of imposter speakers captures underlying sound classes in speech, and it may be more complex (e.g., it may have more than 1024 components). This UBM may be trained by pooling speech from a large number of speakers, such as speakers of different gender and age groups, different recording conditions, speakers with different languages or dialects.

[0012] Context independent speaker verification often needs to work in largely diverse situations, such as across variations in recording conditions (e.g., different recording equipment or bandwidth), the language in the speaker's utterance, the gender of the speaker, etc.

[0013] The diversity of these situations can result in a UBM that covers a very wide feature space, which, in turn, can impact the log likelihood log( )(x\A UBM )) calculated in Equation 2 and makes the tuning of a threshold Θ for Λ in Equation 1 difficult.

[0014] To have a more stable UBM, classified UBMs are sometimes used. In this approach, several UBMs λ 1 , ... , λ Ν are trained for different classes of signals. Then in the speaker verification stage, the signal class is recognized first from a voice signal and the corresponding UBM is used in verifying the voice signal. For example, one UBM might be trained for carbon-button microphone recorded speech and one for electret microphone handset. In the speaker verification stage, it is first determined if the input voice signal matches carbon-button-microphone-recorded speech or electret-microphone-recorded-speech. The appropriate UBM, depending on the preceding classification step, is then used for speaker verification.

SUMMARY

[0015] The present disclosure is concerned with performing speaker verification by forming a universal background model (UBM), based on a captured utterance made by the speaker. For example, in some embodiments, a UBM is formed by combining a portion of one GMM with a portion of another GMM based on the captured utterance (e.g., based on feature vectors created from the captured utterance). Each of the first GMM and second GMM may have been trained offline for a particular class of signals, such as for a particular speaker gender, language (e.g., English, French, or dialects thereof), or recording condition. While each of these GMMs may itself be usable as a classified UBM, a limitation of using classified UBMs is that they make a hard split of the test utterance space. This can lead to problems when there is a classification error, or when the test utterance does not precisely match one of the classes. In one example, a classification error may occur when a test utterance is being captured with a carbon-button microphone and is incorrectly identified as a test utterance being captured with an electret microphone, leading to the sub-optimal selection of a UBM-GMM trained on electret microphone input. Further, if the test utterance was captured using a new type of microphone, or in an otherwise unforeseen recording condition, it may not match any preexisting (e.g., pre- stored) GMM. As another example, if the GMMs were all trained using native speakers of various languages, a test utterance from a non-native speaker for one of the languages may also not match any of the GMMs.

[0016] To address the above limitations of classified UBMs, this disclosure concerns the formation of a new GMM from preexisting GMMs, and using the new GMM as the UBM for the speaker verification process. In some embodiments, it concerns the formation of a new GMM in a data-dependent manner, so that an utterance captured from a user is used to select which portions of the preexisting GMMs to use in dynamically generating a new GMM as the UBM. This data-dependent fusion of GMMs can better adapt the newly formed UBM to the content of the captured utterance, and is more robust against classification errors. In an embodiment, this approach avoids the need to use a hard decision between different models (as in decision tree methods) or combining scores of different models (as in committees or mixture of expert methods).

[0017] One aspect of the present disclosure thus presents a machine implemented method for determining whether an utterance, U, was spoken by a hypothesized speaker, S. The method comprises obtaining data corresponding to an utterance, U, captured by a microphone. It further comprises obtaining a set of two or more Gaussian Mixture Models (GMMs). The set of GMMs comprises a first GMM, λΑ, comprising a first set of parameter pairs. Each parameter pair included in the first set of parameter pairs is associated with a weight and consists of a mean vector and a corresponding covariance matrix. The set of GMMs further comprises a second GMM, λβ, comprising a second set parameter pairs. Each parameter pair included in the second set of parameter pairs is associated with a weight and consists of a mean vector and a corresponding covariance matrix. The method further comprises selecting, based on the data corresponding to the captured utterance, U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs. The first subset of parameter pairs is combined with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs. The method further comprises obtaining a GMM model, Xs, associated with the hypothesized speaker, S. The method further comprises using s, XUBM, and the data corresponding to the captured utterance, U, to determine whether U was spoken by S.

[0018] In some implementations, the first GMM is a first speaker-independent GMM trained with speech samples from a first set of speakers.

[0019] In some implementations, the second GMM is a second speaker-independent GMM trained with speech samples from a second set of speakers.

[0020] In some implementations, the step of selecting the parameter pair subsets based on the data corresponding to the captured utterance comprises: i) using (402) the data to obtain a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,x n , wherein n > 1 ; ii) for each feature vector x l ..,x n , assigning (404) the feature vector to one and only one of the GMMs included in said set of GMMs; iii) determining (406) Ti,where Ti is the total number of feature vectors assigned to the first GMM; iv) determining (408) T 2 , where T 2 is the total number of feature vectors assigned to the second GMM; v) determining (410) a value i based on Ti and n; vi) determining (412) a value N 2 based on T 2 and n; vii) selecting (414) i parameter pairs from the first set of parameter pairs; and viii) selecting (416) N 2 parameter pairs from the second set of parameter pairs, where λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N 2 parameter pairs from the second set of parameter pairs.

[0021] In some implementations, the data corresponding to the captured utterance comprises a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,x n , wherein n > 1 , and the step of selecting the parameter pair subsets based on the data

corresponding to the captured utterance comprises: i) for each feature vector xi,...,x n , assigning the feature vector to one and only one of the GMMs included in said set of GMMs; ii) determining Ti, where Ti is the total number of feature vectors assigned to the first GMM; iii) determining T 2 , where T 2 is the total number of feature vectors assigned to the second GMM; iv) determining a value i based on Ti and n; v) determining a value N 2 based on T 2 and n; vi) selecting i parameter pairs from the first set of parameter pairs; and vii) selecting N 2 parameter pairs from the second set of parameter pairs, where λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N 2 parameter pairs from the second set of parameter pairs.

[0022] In some implementations, selecting the i parameter pairs from the first set of parameter pairs comprises: i) for each parameter pair included in the first set of parameters pairs, determining an accumulated posterior probability; ii) selecting from the first set of parameter pairs the i parameter pairs having the highest accumulated posterior probability.

[0023] In some implementations, determining an accumulated posterior probability for a parameter pair included the first set of parameter pairs comprises: i) for each feature vector ΧΙ , . , . ,ΧΤΊ assigned to the first GMM, calculating a posterior probability using the feature vector and said parameter pair; and ii) summing said calculated posterior probabilities.

[0024] In some implementations, using the data corresponding to the captured utterance U to obtain a set of feature vectors X comprises i) dividing the data into a plurality of signals si, ...s n corresponding to different frames in time and, ii) for each signal Si, calculating a plurality of parameter values that describe characteristics of the signal, wherein the corresponding vector, Xi, consists of the plurality of parameter values.

[0025] In some implementations, the plurality of parameter values consists of one or more values describing frequency content of the signal Si, and one or more values describing energy content of the signal s;.

[0026] In some implementations, determining i based on Ti and n comprises multiplying Ti/n by Ki, rounded to the nearest whole number, where Ki is a total number of parameter pairs in the first set of parameter pairs. Further, determining N 2 based on T 2 and n comprises multiplying T2/n by K 2 , rounded to the nearest whole number, where K 2 is a total number of parameter pairs in the second set of parameter pairs.

[0027] One aspect of the present disclosure thus presents an apparatus (102/108) for determining whether an utterance, U, was spoken by a hypothesized speaker, S. The apparatus comprises one or more processors (755/855). The one or more processors are configured to obtain data corresponding to an utterance, U, captured by a microphone (104). They are futher configured to obtain a set of two or more Gaussian Mixture Models (GMMs). The set of GMMs comprises a first GMM, λΑ, comprising a first set of parameter pairs, each parameter pair included in the first set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix. The set of GMMs further comprises a second GMM, λβ, comprising a second set parameter pairs, each parameter pair included in the second set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix. The one or more processors are configured to select, based on the obtained data corresponding to the captured utterance U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs. They are further configured to combine the first subset of parameter pairs with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs. The one or more processors are additionally configured to obtain a GMM model, Xs, associated with the hypothesized speaker, S; and to use Xs, XVBM, and the obtained data corresponding to the captured utterance U to determine whether U was spoken by S.

[0028] In some implementations, the first GMM is a first speaker-independent GMM trained with speech samples from a first set of speakers.

[0029] In some implementations, the second GMM is a second speaker-independent GMM trained with speech samples from a second set of speakers.

[0030] In some implementations, the one or more processors (755/855) are configured to select the parameter pair subsets based on the data corresponding to the captured utterance by: i) using the data to obtain a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,x n , wherein n > 1 ; ii) for each feature vector x l ..,x n , assigning the feature vector to one and only one of the GMMs included in said set of GMMs; iii) determining

Ti, wherein Ti is the total number of feature vectors assigned to the first GMM; iv) determining T 2 , wherein T 2 is the total number of feature vectors assigned to the second GMM; v) determining a value Ni based on Ti and n; vi) determining a value N 2 based on T 2 and n; vii) selecting i parameter pairs from the first set of parameter pairs; and viii) selecting N 2 parameter pairs from the second set of parameter pairs, wherein λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N 2 parameter pairs from the second set of parameter pairs. [0031] In some implementations, the data corresponding to the captured utterance comprises a set of feature vectors, X, said set of feature vectors comprising n feature vectors, xi,...,x n , wherein n > 1 , wherein the one or more processors are configured to select the parameter pair subsets based on the data corresponding to the captured utterance by: i) for each feature vector xi,...,x n , assigning the feature vector to one and only one of the GMMs included in said set of GMMs; ii) determining Ti, wherein Ti is the total number of feature vectors assigned to the first GMM; iii) determining T 2 , wherein T 2 is the total number of feature vectors assigned to the second GMM; iv) determining a value i based on Ti and n; v) determining a value N 2 based on T 2 and n; vi) selecting i parameter pairs from the first set of parameter pairs; and vii) selecting N 2 parameter pairs from the second set of parameter pairs, wherein λυΒΜ comprises said i parameter pairs from the first set of parameter pairs and said N 2 parameter pairs from the second set of parameter pairs.

[0032] In some implementations, the one or more processors (755/855) are configured to select the i parameter pairs from the first set of parameter pairs by: i) for each parameter pair included in the first set of parameters pairs, determining an accumulated posterior probability; ii) selecting from the first set of parameter pairs the i parameter pairs having the highest accumulated posterior probability.

[0033] In some implementations, the one or more processors (755/855) are configured to determine an accumulated posterior probability for a parameter pair included the first set of parameter pairs by: i) for each feature vector χι,.,.,χχι assigned to the first GMM, calculating a posterior probability using the feature vector and said parameter pair; and ii) summing said calculated posterior probabilities.

[0034] In some implementations, the one or more processors (755/855) are configured to use the data corresponding to the captured utterance U to obtain a set of feature vectors X by i) dividing the data into a plurality of signals si, ...s n corresponding to different frames in time and, ii) for each signal Si, calculating a plurality of parameter values that describe characteristics of the signal, wherein the corresponding vector, Xi, consists of the plurality of parameter values.

[0035] In some implementations, the plurality of parameter values consists of one or more values describing frequency content of the signal Si, and one or more values describing energy content of the signal s;. [0036] In some implementations, the one or more processors (755/855) are configured to determine Ni based on Ti and n by multiplying Ti/n by Ki, rounded to the nearest whole number, wherein Ki is a total number of parameter pairs in the first set of parameter pairs, and the one or more processors (755/855) are configured to determine N 2 based on T 2 and n comprises multiplying Τ2Λ1 by K 2 , rounded to the nearest whole number, wherein K 2 is a total number of parameter pairs in the second set of parameter pairs.

[0037] One aspect of the present disclosure presents a computer program product (741/841) comprising a non-transitory computer-readable medium (742/842) storing a computer program for determining whether an utterance, U, was spoken by a hypothesized speaker, S. The computer program comprises computer readable instructions (744/844) which, when executed by one or more processors (755/855), cause the one or more processors to carry out the example machine implemented method illustrated above.

[0038] One aspect of the present disclosure presents a non-transitory computer-readable medium (742/842) storing computer-readble instructions (744/844) for a computer program that determines whether an utterance, U, was spoken by a hypothesized speaker, S. The computer readable instructions (744/844) are adapted to cause one or more processors (755/855) to carry out the example machine implemented method illustrated above.

[0039] These and other aspects and embodiments are further described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

[0040] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments of the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention. In the drawings, like reference numbers indicate identical or functionally similar elements.

[0041] FIG. 1 illustrates a speaker verification system.

[0042] FIG. 2 illustrates a speaker verification module.

[0043] FIG. 4-6 show flow charts illustrating various processes according to some embodiments of the present disclosure.

[0044] FIG. 7 is a functional block diagram of a client device, according to an embodiment of the present disclosure. [0045] FIG. 8 is a functional block diagram of a server, according to an embodiment of the present disclosure.

[0046] FIG. 9 illustrates functionality of speaker verification module.

DETAILED DESCRIPTION

[0047] Referring now to FIG. 1 , FIG. 1 illustrates a speaker verification system 100 for verifying whether a given test utterance, U, captured from a user 101 was spoken by a hypothesized speaker, S, such as a subscriber or enrolled person whom user 101 claims to be. The system 100 includes a microphone 104 for capturing the test utterance U from user 101 and a speaker verification module for determining whether the segment of speech matches the hypothesized speaker S . In an embodiment, the microphone 104 may be integrated into a client device 102, such as a mobile communication device (e.g., smartphone or laptop) or desktop computer. In an embodiment, the microphone 104 may be a standalone component that communicates with the device 102 via a connector or a wireless connection. The verification module may be implemented solely as a verification module 110a on client device 102, implemented solely as a verification module 110b on a remote server 108, implemented on both the client device 102 and the remote server 108, or on some other device. The client device 102 and server 108 may communicate through a network 106. If the speaker verification module were implemented solely on server 108, the microphone 104 may, in an embodiment, be a standalone component with a transceiver that communicates with server 108 through network 106. In this embodiment, the client device 102 may be omitted.

[0048] The network may carry information needed to carry out any functionality of the speaker verification module implemented in module 110b on server 108. The information may include, e.g., data corresponding to an utterance, U, captured by the microphone 104. In some instances, this data may include values of a waveform of the utterance U itself. In some instances, the data may include parameter values derived from the captured utterance U, such as feature vectors xi, x 2 , . . . x n that represent characteristics of the captured utterance U.

[0049] FIG. 2 illustrates the speaker verification module 1 lOa/110b using the input of an utterance U and a plurality of Gaussian mixture models (GMMs) (e.g., λΑ and λ Β ) and outputting a new Gaussian mixture model, λυΒΜ, that is used as the universal background model, UBM, for performing speaker verification on the utterance U. [0050] Each of the GMMs may have been trained using a particular predefined class of speakers or predefined recording condition. For example, λΑ may be a GMM trained using a first set of speakers speaking only a first language (e.g., a first dialect), and B ay have been trained using a second set of speakers speaking only a second language. The two sets may have speakers in common, or may be exclusive. In another example, λΑ may have been trained using only speech samples captured by a carbon-button microphone, while λβ may have been trained using only speech samples captured by an electret microphone handset. In yet another example, λΑ may have been trained using a large number of only male speakers, while λβ may have been trained using a large number of only female speakers. In some embodiments, λΑ or λβ may be trained using a combination of the above classes or conditions (e.g., λΑ is trained using only male speakers of the first language). The verification module 1 10a/ 1 10b is not limited to λΑ and λβ as its input, but may use a larger number of pre-existing GMMs (e.g., λο ; λτ> ; λπ , etc.) to form the UBM.

[0051] The GMMs λΑ and λ Β may be stored on device 102, server 108, or on some other device. The device may store other GMMs (λο, λο, etc.). In an embodiment, these GMMs may be pre-existing GMMs obtained from a database or other storage device. They may have been generated with various utterances (e.g., voice samples) during earlier periods of offline training. Each GMM may be stored as a set of parameter pairs, along with their respective weights, as discussed below.

[0052] In FIG. 2, each GMM includes a set of parameter pairs which are associated with different respective weights. Each parameter pair may consist of a mean vector μ and a co variance matrix∑. The combination of the respective weights and parameter pairs for λΑ may be denoted as {wA,k ; V-A,k ;∑A,k} ¾ = ι · This indicates that the GMM λΑ consists of K parameter pairs of μ and∑, indexed from k=l to K, and their corresponding weights w. Each parameter pair may make up one of K components Φ of the GMM. The kth component, which may be denoted as i>A,k, is a Gaussian density function defined by the pair of parameters: mean vector μΑ^ and covariance matrix∑A,k- This component Φ outputs a value as a function of an input vector x. This function is a more specific version of Equation 3, and is illustrated below:

. . _ εχρ{- (χ Λ¾ ) Τ ∑¾ 1 Λ¾ )}

ΦΑ,Ι{ χ η) — D T

Equation (3 a) [0053] In Equation 3a, the input vector is labeled x n to denote that an utterance may be used to calculate a plurality of (i.e., n) vectors which represent characteristics of the utterance. For instance, speaker verification module 1 lOa/110b or some other module may process the test utterance U being captured or already captured by microphone 104 and divide the test utterance U into smaller portions, such as frames (e.g., into 10-millisecond frames). For each frame, the module may generate a vector x n having parameter values that represent voice characteristics (e.g., speech characteristics) of that frame, and possibly of neighboring frames. Example values include a maximum energy level in the frame, average energy level in the frame, logarithm of the energy level in the frame, fundamental frequency in the frame, logarithm of the fundamental frequency in the frame, a first-order or second-order derivative of the frequency content or energy level in that frame over that frame and several of its adjacent frames. Different implementations may have different number of parameter values. In some implementations, the vector x n may have 4 parameter values (i.e., the dimensionality D of x n is 4). In some implementations, the vector x n may have tens, hundreds, or thousands of parameter values. The mean vector μΑ^ η^ be a Dxl vector having the respective mean values for parameters in the vector. The covariance matrix∑ A,k may be a DxD matrix for those parameters. Thus, the notation {x n — μ Α ¾ ·) in Equation 3a indicates a lxD vector that is a tranpose of the vector

[0054] The parameter pair for i> A , k may be associated with a weight W A , k - FIG. 2 illustrates an example in which λΑ includes at least four components that each consist of a parameter pair. The kth component i> A , k consist of the parameter pair μΑ^ and∑ A , k , and is associated with a weight W A , k - Similarly, FIG. 2 illustrates an example in which the λβ includes at least four components that each consist of a parameter pair. The component Φ Β ^ consist of the parameter pair μ Β ^ and∑B,k, and is associated with a weight WB,k- [0055] During speaker verification, relying on only λΑ or on only λβ as a universal background model (UBM) may be sub-optimal. For example, if λAand λβ corresponded to two different pre-existing models trained with speech samples of two different languages, respectively, a classification error in the language being spoken by user 101 may lead λΑ to be selected as the UBM when the user was actually speaking the language associated with λ Β . As another example, if the GMMs λΑ, λ Β , etc. were trained for predefined classes, the speech of user 101 may not fit exactly into one of the predefined classes. For instance, λΑ may have been trained using only native speakers of a first language (e.g., English) and λβ may have been trained using only native speakers of a second language (e.g., French). The user 101 at the voice verification stage, however, may be a non-native speaker of the first language (e.g., a French person speaking English). In this instance, the test utterance U may not fit exactly into either the class for λΑ (e.g., native English speakers) or the class for λ Β (e.g., native French speakers). In another instance, λΑ, λ Β , etc. may have been trained using respective known types of microphones and thus correspond to those predefined recording conditions, while the user 101 's voice is being or was captured using a new type of microphone 104 or in an otherwise unforeseen recording condition. The captured test utterance U may again not be able to fit exactly into any of the predefined classes. Thus, using a pre-existing GMM for a predefined class, such as using solely λΑ or solely λβ as the UBM for speaker verification, may be sub-optimal.

[0056] The UBM in FIG. 2 is a GMM λυΒΜ which combines parameter pairs from a preexisting GMM λΑ for a first predefined class and a preexisting GMM λβ for a second predefined class. In this example, λυΒΜ combines at least three parameter pairs from λΑ and at least one parameter pair from λβ. In an embodiment, the generated UBM may have the same number of components or parameter pairs as any of the pre-existing GMMs. If the pre-existing GMMs λΑ, λβ, etc. each include exactly K parameter pairs, the UBM may also include exactly K parameter pairs. The value of K may vary for different embodiments, and may have a range from several parameter pairs (e.g., 4) to thousands or tens of thousands of parameter pairs, or have some other range. Each of the parameter pairs in the UBM may be associated with a weight WUBM, I ; WUBM,2 ; WUBM,3 ; WUBM,4 ; etc. As explained in more detail below, the weights may be normalized versions of the weights WA, I ; WA,2 ; WA,3 ; WB,2 of the parameter pairs selected for inclusion in the UBM. The normalization may be performed so that the weights for the UBM λυΒΜ add to 1. Also as explained in more detail below, the verification module 1 lOa/1 10b may use the UBM λυΒΜ to verify whether the test utterance U from user 101 matches a hypothesized speaker S.

[0057] FIG. 3 provides a flow diagram which illustrates an example machine- implemented method 300 for determining whether a given utterance, U, was spoken by a hypothesized speaker, S. In an embodiment, the method begins at step 302, in which a module (e.g., module 1 10a on client device 102 and/or module 1 10b on server 108) obtains data corresponding to an utterance, U, captured by a microphone (e.g., a standalone microphone device, or a microphone 104 integrated into device 102) to obtain the given segment of speech, Y. If the step is performed at least in part by module 1 10b on server 108, the module 1 10b may communicate with the microphone via network 106. In an embodiment, the segment of speech, Y, is obtained from only a single user. In some cases, the obtained data may be values of a waveform representing the utterance U, and may be used to determine a set of feature vectors xi ... x n . In some cases, the obtained data may be the feature vectors themselves. This may occur, for instance, when the server 108 obtains feature vectors which had been generated by the client device 102 from an utterance U.

[0058] In step 304, the module 1 lOa/1 10b may obtain a set of two or more Gaussian

Mixture Models (GMMs). The GMMs may comprise a first GMM, λΑ, comprising a first set of parameter pairs. Each parameter pair included in the first set of parameter pairs may be associated with a weight and consist of a mean vector and a corresponding covariance matrix. The set of GMMs may further comprise a second GMM, λβ, comprising a second set parameter pairs. Each parameter pair included in the second set of parameter pairs may be associated with a weight and consist of a mean vector and a corresponding covariance matrix. In an

embodiment, the GMMs may be obtained from a local or remote storage device, such as a storage device on device 102 or server 108. In an embodiment, the obtained set of GMMs may be pre-existing GMMs that were trained for predefined classes of speakers or predefined recording conditions. In one example, the first GMM λΑ is a first speaker-independent GMM trained with utterance samples (e.g., speech samples) from a first set of speakers (e.g., male speakers or a set of speakers using a first type of microphone, such as a carbon button microphone). In this example, the second GMM λβ may be a second speaker-independent GMM trained with utterance samples from a second set of speakers (e.g., female speakers or a set of speakers using a second type of microphone, such as an electret microphone). A speaker- independent GMM may be trained with speakers in a particular class or using a particular recording condition, and be able to recognize whether a speaker belongs to that class or recording condition even if that speaker was not one of those used to train the speaker- independent GMM. [0059] In step 306, the module may select, based on the obtained data corresponding to the captured utterance, U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs. For instance, the module may select 75% of the parameter pairs in λΑ (and not the other 25% of the parameter pairs in XA) and select 25% of the parameter pairs in λβ (and not the other 75% of the parameter pairs in λβ).

[0060] In step 308, the module may combine the first subset of parameter pairs with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs (e.g., comprises the selected 75% of the parameter pairs in λΑ and the selected 25% of the parameter pairs in λβ).

[0061] In step 310, the module obtains a GMM model, Xs, associated with the hypothesized speaker, S. The GMM Xs may be a speaker-dependent GMM trained using utterance samples from only the hypothesized speaker, and may also include a plurality of parameter pairs with associated weights. In an embodiment, Xs may contain considerably fewer parameter pairs compared to λΑ or λ Β (e.g., 64 parameter pairs versus 1024 parameter pairs).

[0062] In step 312, the module uses s, XUBM, and the obtained data corresponding to U to determine whether U was spoken by the hypothesized speaker S. As discussed in more detail below, this step may include using one or more vectors x n extracted from U and, for each vector, calculating the logarithm of p(x n | s) and the logarithm of p(x n | λυΒΜ)· The logarithms may be added across the one or more vectors x n to yield a log-likelihood score for Xs and a log-likelihood score for λυΒΜ. If the difference between these two log-likelihood scores crosses an optimized threshold level Θ, the module may output an indication that utterance U from user 101 was spoken by the hypothesized speaker S.

[0063] A more specific example of the selection of the subsets of the parameter pairs in step 306 is illustrated in FIG. 4. In this example, this selection includes steps 402-416. In step 402, the module uses the data corresponding to U (from step 302) to obtain a set of feature vectors, X. For this step, the data corresponding to U may be values of a waveform that represents the captured utterance U. The set X of feature vectors comprises n feature vectors, xi,...,x n , wherein n > 1. The discussion below includes an example in which n=4. In some applications, n may be in the range of hundreds to tens of thousands of feature vectors, or some other range. In some cases, n may depend on an overall speaking rate of the utterance being captured by speech segment Y. Each feature vector may correspond to, for instance, a diffrent frame (e.g., a different 20 ms frame) of speech segment Y. The frames may overlap, or be non- overlapping. The feature vector for a frame may include parameter values that characterize voice characteristics (e.g., speech characteristics) of the frame, such as values on the energy content or frequency content of the frame, or the derivative of such values.

[0064] In some embodiments, using the obtained data corresponding to U to obtain a set of feature vectors X comprises dividing the data into a plurality of signals Si, ...s n corresponding to different frames in time and, for each signal Si, calculating a plurality of parameter values that describe characteristics of the signal, wherein the corresponding vector, Xi, consists of the plurality of parameter values. In some embodiments, the plurality of parameter values consists of one or more values describing frequency content of the signal Si, and one or more values describing energy content of the signal s;.

[0065] In some embodiments, step 402 may be omitted, because the obtained data corresponding to the captured utterance in step 302 may already comprise a set of feature vectors, xi, x n , where n>l . The feature vectors may each includes parameter values which characterize voice characteristics of a respective frame of the captured utterance U. In one example, such feature vectors may have been generated by client device 102 and obtained by module 1 10b on server 108.

[0066] In step 404, for each feature vector xi,...,x n , the module assigns the feature vector to one and only one of the GMMs included in said set of GMMs. For example, it may assign 75% of the feature vectors to GMM λΑ, and 25% of the feature vectors to λ Β . In some implementations, the module may assign a vector x n to a GMM representing a class of speakers or recording conditions that most closely matches x n . More specifically, it may assign vector x n to the GMM which, among all of the set of GMMs in step 304, yields the highest output value, ρ(χ η |λ) (also denoted λ(χ η )), or highest logarithm of the output value, log (λ(χ η )).

[0067] The logarithm of the output value for a GMM λ may be denoted log (ρ(χ η |λ)) or log^(x n )), and may be calculated using Equation 2, which is also reproduced below: log(p(½|/0) = fc O fc (x„)

Equation (2)

[0068] In the equation above,∑ denotes a summation operation (and not the covariance matrix) for K components Φ that each consists of a parameter pair and is associated with a weight w. If the log likelihood of x n , log (ρ(χ η |λ)), for a particular GMM is greater than the log likelihood of x n for all other GMMs in the set of GMMs obtained in step 304, the feature vector x n is assigned to that GMM.

[0069] In step 406, the module determines Ti, wherein Ti is the total number of feature vectors assigned to the first GMM (e.g., 3 assigned feature vectors out of total of 4 feature vectors generated from Y, or 3000 assigned feature vectors out of 4000 feature vectors generated from Y). Ti may be referred to as a cardinality of the set of feature vectors assigned to the first GMM.

[0070] In step 408, the module determines T 2 , wherein T 2 is the total number of feature vectors assigned to the second GMM (e.g., 1 assigned feature vectors out of total of 4 feature vectors generated from Y, or 1000 assigned feature vectors out of 4000 feature vectors generated from Y). T 2 may be referred to as a cardinality of the set of feature vectors assigned to the second GMM.

[0071] In step 410, the module determines a value i based on Ti and n. In some implementations, i = Ki * Ti/n, rounded to the nearest whole number, where Ki is the total number of parameter pairs in the first GMM λΑ (e.g., i = 4 * ¾ = 4 * 75%). While the value of Ki is 4 in this example, in some applications it may be on the order of 1000 or 10,000, or some other number.

[0072] In step 412, the module determines a value N 2 based on T 2 and n. In some implementations, N 2 = K 2 * T 2 /n, rounded to the nearest whole number, where Ki is the total number of parameter pairs in the first GMM λβ (e.g., i = 4 * ¼ = 4 * 25%).

[0073] In step 414, the module selects i parameter pairs from the first set of parameter pairs. This selection is illustrated in more detail in FIG. 5.

[0074] In step 416, the module selects N 2 parameter pairs from the second set of parameter pairs. The GMM λυΒΜ may comprise the i parameter pairs from the first set of parameter pairs and the N 2 parameter pairs from the second set of parameter pairs (e.g., 3 parameter pairs from λΑ and 1 parameter pair from λβ in one example, or 3000 parameter pairs from λΑ and 1000 parameter pairs from λβ in another example). In some cases, Ki=K2=K. If the UBM is formed from only parameter pairs of the first GMM and the second GMM, then i + N 2 = K in some cases. If the UBM is formed from additional components, such as N 3 parameter pairs from a third GMM, etc., then the sum of Ni, N 2 , N 3 , . . . M may be equal to K.

[0075] FIG. 5 illustrates example details of the selection of the i parameter pairs in step

414. The example includes step 502, in which, for each parameter pair included in the first set of parameter pairs, the module determines an accumulated posterior probability, as illustrated in more detail in FIG. 6.

[0076] In step 504, the module selects from the first set of parameter pairs the i parameter pairs having the highest accumulated posterior probability. The selection of the N 2 parameter pairs in step 416 may be implemented in a similar manner.

[0077] FIG. 6 illustrates example steps for the determination of an accumulated posterior probability for a parameter pair. In step 602, for each vector xi, . . ., x T i assigned to the first GMM, the module may calculate a posterior probability for the parameter pair. This calculation may use Equation 3 above. Step 602 may be performed for each parameter pair included in the first set of parameter pairs.

[0078] In step 604, the module may sum the calculated posterior probabilities. This yields the accumulated posterior probability for a particular parameter pair. Step 604 may be performed for each parameter pair included in the first set of parameter pairs.

[0079] As an example of the above steps, four feature vectors xi, . . ., X 4 may be generated from utterance U, and three of the feature vectors (xi, x 2 , x 3 ) may be assigned to the first GMM λΑ and one of the feature vectors X 4 may be assigned to the second GMM λ Β . Thus, in this example n=4, Ti=3, T 2 =l . Further, both the first and the second GMM may each have four parameter pairs, such that Ki=K 2 =4. Step 410 may then determine i as Ki*Ti/n = 3, and step 412 may determine N 2 as K 2 *T 2 /n = 1.

[0080] In step 414, the selection of i (e.g., 3) parameter pairs from a first set of parameter pairs in the first GMM λΑ can include step 502, where, for each parameter pair included in the first set, an accumulated posterior probability is determined. This can include step 602, where, for each of the vectors xi, x 2 , x 3 , a posterior probability may be calculated for each of the parameter pairs that make the GMM λΑ. This is illustrated in Table la below.

Posterior probabilities of each of the components

feature vectors

x 1 0.5 0.1 0.3 0.1

0.3 0.4 0.2 0.1

*3 0.4 0.3 0.1 0.2

Accumulated 1.2 0.8 0.6 0.4

Posterior

Probability

Table 1 a: Selection of 75% of components in λΑ with highest accumulated posterior probability.

The first three parameter pairs are selected and will be pooled into the new λυΒΜ·

[0081] Referring to the example in Table 1 a above, for xi the posterior probability for the parameter pair of Φ Α , Ι is calculated as 0.5. For x 2 the posterior probability for the parameter pair of Φ Α , Ι is calculated as 0.3. For x 3 the posterior probability for the parameter pair of Φ Α , Ι is calculated as 0.4. Step 604 sums these values, to yield a value of 1 .2. This is the accumulated posterior probability for Φ Α , Ι . Steps 602 and 604 can be repeated in a similar manner to yield the accumulated posterior probability values of 0.8, 0.6, and 0.4 for Φ Α , 2 ; Φ Α , 3 ; and Φ Α , 4 , respectively.

[0082] The selection in step 414 may further include step 504, which selects from the first set of parameter pairs (Φ Α , Ι ; Φ Α , 2 ; Φ Α , 3 ; Φ Α , 4 ) the ι (3 in this example) parameter pairs having the highest accumulated posterior probability. As shown above, the parameter pairs corresponding to Φ Α , Ι ; Φ Α , 2 ; Φ Α , 3 are the three parameter pairs with the highest accumulated posterior probability values (of 1.2, 0.8 and 0.6), and are selected in step 504.

[0083] Similarly, the calculation of the accumulated posterior probability for a parameter pair for Φ Β ,ι in the second GMM involves, for each feature vector X 4 assigned to the second GMM, calculating a posterior probability using the feature vector and said parameter pair, and summing said calculated posterior probabilities. In this case, there is only one posterior probability value (0.1 ) in the sum. The accumulated posterior probability values for Φ Β , 2 ; Φ Β , 3 ; and <I>B, 4 may be calculated in a similar manner as 0.5, 0.2, and 0.2, respectively. This is illustrated in Table lb:

Posterior probabilities of each of the components

Feature vectors

0.1 0.5 0.2 0.2

Accumulated 0.1 0.5 0.2 0.2

Posterior

Probability

Table lb: Selection of 25% of components in λβ with highest accumulated posterior probability.

The second parameter pair is selected and will be pooled into the new λυΒΜ·

[0084] Referring back to step 416, N 2 parameter pairs (1 parameter pair in this example) may be selected from the second set of parameter pairs (from the parameter pairs for Φ Β ,ι ; ΦΒ,2 ; ΦΒ,3 ; and ΦΒ,4). In an embodiment, the N 2 parameter pairs with the highest accumulated posterior probability may be selected. In this example, this corresponds to the parameter pair for ΦΒ,2, which has an accumulated posterior probability value of 0.5.

[0085] In the example above, the new GMM λυΒΜ used for the UBM will consist of the parameter pairs for the components ΦΑ,Ι ; ΦΑ,2 ; ΦΑ,3 ; ΦΒ,2· Each of the parameter pairs may be associated with respective weights WUBM, I ; WUBM,2 ; WUBM,3 ; WUBM,4. In one example, this may involve dividing the weights that the selected parameter pairs have in the preexisting GMMs (WA, I ; WA,2 ; WA,3 ; WB,2) by a normalization factor, such that they sum to 1 (e.g., divide them by the sum WA, I + WA,2 + WA,3 + WB,2) . In another example, the weights may be divided by a normalization factor such that the sum of the posterior probability values corresponding to the selected parameter pairs sum to 1. For instance, the posterior probability values corresponding to ΦΑ, Ι ; ΦΑ,2 ; ΦΑ,3 may equal 0.5, 0.1 , and 0.3, respectively, and the posterior probability value corresponding to ΦΒ,2 may equal 0.5. These values sum to 1 .4. These weights WA,I ; WA,2 ; WA,3 ; WB,2 may be normalized by dividing them by the sum of the posterior probability values (1.4).

[0086] Referring back to step 312, the module may calculate log-likelihood values for the

GMM λ 8 and the GMM λυΒ -likelihood value for λ 8 may be calculated, e.g., as:

[0087] The log-likelihood value for λυΒΜ may be calculated, e.g., as: \og(p(x\A UBM )) = ^ log i _^ υΒΜ Φ υΒΜ (x„)

n=l \fc=l

[0088] In the above example, <E>uBM,k may be equal to i>A,k for k=l through 3

(representing the selected parameter pairs from and <E>UBM,4 may be equal to ΦΒ,2

(representing the selected parameter pairs from λβ). If log(p( | l s )) minus log(p( | ly BM )) is greater than a predetermined threshold Θ, the module may indicate that the user 101 is verified as the hypothesized speaker S. Otherwise, it may indicate that the user 101 is not verified as the hypothesized speaker S.

[0089] Exemplary Client Device

[0090] FIG. 7 illustrates a block diagram of an example of the client device 102. As shown in FIG. 7, client device 102 may include: a microphone 104 (e.g., carbon button microphone or electret microphone) adapted to capture a segment of speech from a user; a computer system (CS) 702, which may include one or more processors 755 (e.g., a general purpose microprocessor and/or one or more other data processing circuits, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 705 (including, e.g., a transceiver) for use in connecting the device to a network (e.g., PLMN) and communicating with other units connected to the network; an antenna 722, and a data storage system 706 for storing information (e.g., network slice information received from network management node (e.g., NM or DM), which may include one or more non-volatile storage devices and/or one or more volatile storage devices (e.g., random access memory (RAM)). In embodiments where computer system 702 includes a general purpose microprocessor, a computer program product (CPP) 741 may be provided. CPP 741 includes a non-transitory computer readable medium (CRM) 742 storing a computer program (CP) 743 comprising computer readable instructions (CRI) 744. CRM 742 may be a non-transitory computer readable medium (i.e., magnetic media (e.g., a hard disk), optical media (e.g., a DVD), flash memory, and the like). In some embodiments, the CRI 744 of computer program 743 is configured such that when executed by data processing system 702, the CRI causes the computer system to perform steps described herein. In other embodiments, computer system 702 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software. For example, in some embodiments, the functional components of client device 102 described above may be implemented by data processing system 702 executing program code 743, by data processing system 702 operating independent of any computer program code 743, or by any suitable combination of hardware and/or software.

[0091] In a second embodiment, client device 102 further includes: 1) a display screen coupled to the data processing system 702 that enables the data processing system 702 to display information to a user of client device 102; 2) a speaker coupled to the data processing system 702 that enables the data processing system 702 to output audio to the user of device 102; and 3) a microphone 104 coupled to the data processing system 702 that enables the data processing system 702 to receive audio from the user.

[0092] Exemplary Server

[0093] FIG. 8 is a block diagram of an embodiment of server 108. As shown in FIG. 8, server 108 may include: a computer system (CS) 802, which may include one or more processors 855 (e.g., a general purpose microprocessor and/or one or more other data processing circuits, such as an application specific integrated circuit (ASIC), field-programmable gate arrays

(FPGAs), and the like); a network interface 805 for use in connecting the network node to a network (e.g., core network) and communicating with other units connected to the network; a transceiver 807 coupled to an antenna 808 for wirelessly communicating with WCDs; and a data storage system 806 for storing information (e.g., network slice information received from network management node (e.g., NM or DM), which may include one or more non-volatile storage devices and/or one or more volatile storage devices (e.g., random access memory (RAM)). In embodiments where computer system 802 includes a general purpose

microprocessor, a computer program product (CPP) 841 may be provided. CPP 841 includes a non-transitory computer readable medium (CRM) 842 storing a computer program (CP) 843 comprising computer readable instructions (CRI) 844. CRM 842 may be a non-transitory computer readable medium (i.e., magnetic media (e.g., a hard disk), optical media (e.g., a DVD), flash memory, and the like). In some embodiments, the CRI 844 of computer program 843 is configured such that when executed by data processing system 802, the CRI causes the computer system to perform steps described herein. In other embodiments, computer system 802 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0094] In an embodiment, the system 100 includes an apparatus 900 for determining whether an utterance U was spoken by a hypothesized speaker, S, where the apparatus 900 comprises: means for obtaining (902) data corresponding to an utterance, U, captured by a microphone; means for obtaining (904) a set of two or more Gaussian Mixture Models (GMMs), wherein said set of GMMs comprises a first GMM, λοΜΜΐ , comprising a first set of parameter pairs, each parameter pair included in the first set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix, and wherein said set of GMMs further comprises a second GMM, λοΜΜ2, comprising a second set parameter pairs, each parameter pair included in the second set of parameter pairs being associated with a weight and consisting of a mean vector and a corresponding covariance matrix; means for selecting (906), based on the obtained data corresponding to the captured utterance, U, i) a subset of the first set of parameter pairs, thereby forming a first subset of parameter pairs, and ii) a subset of the second set of parameter pairs, thereby forming a second subset of parameter pairs; means for combining (908) the first subset of parameter pairs with the second set of parameter pairs to form a Universal Background Model (UBM), λυΒΜ, such that the UBM, λυΒΜ, comprises the first subset of parameter pairs and the second subset of parameter pairs; means for obtaining (910) a GMM model, Xs, associated with the hypothesized speaker, S; and means for using (912) Xs, λυΒΜ, and the data corresponding to the captured utterance U to determine whether U was spoken by S.

[0095] While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments.

[0096] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, and the order of the steps may be re-arranged.