Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SPEECH ANALYTICS SYSTEM AND METHODOLOGY WITH ACCURATE STATISTICS
Document Type and Number:
WIPO Patent Application WO/2014/107141
Kind Code:
A1
Abstract:
The present invention relates to implementing new ways of automatically and robustly evaluating agent performance, customer satisfaction, campaign and competitor analysis in a call-center and it is comprising; analysis consumer server, call pre-processing module, speech- to-text module, emotion recognition module, gender identification module and fraud detection module.

Inventors:
ARSLAN MUSTAFA LEVENT (TR)
HAZNEDAROĞLU ALI (TR)
Application Number:
PCT/TR2013/000002
Publication Date:
July 10, 2014
Filing Date:
January 03, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SESTEK SES VE ILETIŞIM BILGISAYAR TEKNOLOJILERI SANAYII VE TICARET ANONIM ŞIRKETI (TR)
International Classes:
H04M3/42; G06N20/00; G10L15/07; G10L15/20; G10L15/26; G10L17/26; G10L25/63; G10L25/84; G10L25/93; H04L12/18; H04M3/22; H04M3/51
Domestic Patent References:
WO2008042725A22008-04-10
Foreign References:
US20020194002A12002-12-19
US20110282661A12011-11-17
US20110010173A12011-01-13
US20040249650A12004-12-09
US20060233347A12006-10-19
JPH05204394A1993-08-13
US20080228478A12008-09-18
US20110010173A12011-01-13
US20060277465A12006-12-07
Attorney, Agent or Firm:
ANKARA PATENT BUREAU LIMITED (Bestekar Sok. No: 10, Ankara, TR)
Download PDF:
Claims:
CLAIMS

1. The invention is a speech analytics system (S) providing accurate statistics by use of

- a voice recording system (1) that records the agent/customer calls and writes them in its database (1.2);

- an analysis provider server (2) that adds the newly recorded calls by the said voice recording server (1) to the analysis queue / in a manner supporting one another /,

- analysis consumer service (3) which analyzes the calls in the queue, and outputs the analysis results;

And it is characterized in that; it comprises

- call pre-processing module (3.1) which is a sub-module of the analysis said consumer server (3) and separates the channels of the call recorded in said voice recording system (1) into its agent/customer channels, and automatically segments these mono channels into voiced/unvoiced and background voice segments;

- speech-to-text module (3.2) which is the other sub-module of the analysis consumer server (3) that automatically transcribes and time-aligns the agent and customer speech segments;

- emotion recognition module (3.3) which is the another sub-module of the said analysis consumer server (3) that automatically classifies the agent/customer speech segments as angry/non angry;

- gender identification module (7) is a module of the said analysis consumer server

(3) that automatically classifies the customer's gender;

- fraud detection module (8) is a module of the analysis consumer server (3) that decides the fraud probability of the calls using the customer speech segments.

2. A speech analytics system (S) according to claim 1 characterized in that it comprises a post-processing module (9) which is a sub-module of the said analysis consumer server (3) that collects the outputs of the previous modules (3.2 ,3.3, 7, 8) and merges them to get the final analysis results of the call.

3. A speech analytics system (S) according to claim 1 ; characterized in that it comprises a grid-based structure to utilize agent PC's as the said analysis consumer servers (3).

4. A speech analytics system (S) according to claim 1 ; characterized by a training method of the "agent/customer language models" which are used in the said speech-to-text module (3.2), comprising the following process steps;

- separating the call center conversations into agent and customer channels; manually transcribing agent and customer channels;

- calculating distinct n-gram statistics are for agent and customer channels using the transcribed texts;

- building separate language model files for agent and customer channels from these statistics.

5. A speech analytics system (S) according to claim 1 ; characterized by a training method of the "acoustic model" which are used in the Speech-to-text module (3.2), comprising the following process steps;

- tagging the background and incoherent speech parts while manually transcribing the call-center conversations;

- adding a new model called "speech-filler" to the acoustic model phone list;

- constructing an HMM for the speech-filler model using the tagged speech parts of the step;

- building the Acoustic model file with the addition of this filler-model.

6. A speech analytics system (S) according to claim 1 ; characterized by a filtering method for filtering the background speech using energy variations in the said pre-processing module (3.1 ), comprising the following process steps;

- using the information from opposite channel in deciding about the validity of current channel's speech regions;

analyzing the regions where the opposite party in not speaking using the previous voiced/unvoiced decisions;

- from those regions, calculating an estimate volume level for the current participant's speech;

setting a certain ratio of that volume level as a threshold for the voice activity detection system;

- using this threshold to eliminate background speech by comparing it with the current segments energy level.

7. A speech analytics system (S) according to claim 2; characterized by a robust method for agent's speech speed calculation in the post-processing module (9); for agent performance analysis, comprising the following process steps; by using the text and time-aligned outputs of the said speech-to-text module (3.2);

- employing a text area where commonly used agent scripts and phrases with high recognition accuracies are entered; - searching the speech-to-text output of the agent channel for these phrases and scripts;

- calculating the total duration of agent's speech from the time-aligned outputs of these phrases;

- dividing this duration to the total number of letters in the output text.

8. A speech analytics system (S) according to claim 2; characterized by a robust method of calculating interrupt counts in the post-processing module (9) for agent performance and customer satisfaction analysis, comprising the following process step;

- employing a text area where the stop-words for interruption are entered (these words will be ignored in the following steps);

- finding the interrupts from the speech-to-text module's output text and its time- alignments;

- eliminating the interrupts having the stop-words;

dividing the number of interrupts by the total speech duration to get the final result.

9. A speech analytics system (S) according to claim 2; characterized by a robust method of calculating agent's voice monotonicity level calculation in the said post-processing module (9), comprising the following step;

- concentrating on energy level as well and disregard low energy regions;

- calculating the voice monotonicity of speech parts that are not disregarded in the first step.

10. A speech analytics system (S) according to claim 2; characterized by a method for analyzing agent performance by a proposed metric called "Hesitation count per minute", comprising the following steps;

- finding the speech parts that the agent hesitates;

- normalizing the hesitation count by agent speech duration on the entire call by using the outputs of the said speech-to-text module (3.2);

- calculating the "agent hesitation count per minute" parameter in the post-processing module (9).

11. A speech analytics system (S) according to claim 2; characterized by a method of evaluating conversation effectiveness by a proposed metric called "Customer share", comprising the following steps;

finding the speech durations of agent and customer channels by using the time- aligned outputs of the speech-to-text module (3.2); - getting the ratio of these durations to find the customer share in the post-processing module (9).

12. A speech analytics system (S) according to claim 2; characterized by a method of evaluating conversation effectiveness by a proposed metric called "Maximum silence duration", comprising the following steps;

- finding the time slices where both agent and customer do not speak by using the time-aligned outputs of the speech-to-text module (3.2);

- finding silent time slice that has the maximum duration in the post-processing module (9) and returning its duration.

13. A speech analytics system (S) according to claim 1 ; characterized by a method of evaluating conversation effectiveness by a proposed metric called "Anger locations", comprising the following steps;

- dividing the call into three portions as beginning, middle and end;

- labeling these portions if they contain anger by using the outputs of the emotion recognition module (3.3);

- returning anger locations with three successive Boolean variables, where 1 shows found anger in the portion in the post-processing module (9).

14. A speech analytics system (S) according to claim 1 ; characterized by a method of detecting customer fraud situation probabilities by integrating a fraud detection module (8), comprising the following steps;

- selecting the first call which the customer speaks more than one minute as the enrollment record;

separating the customer channel of the call to use and then trimming its silences; generating the customer's GMM based voice-print using the selected and trimmed channel;

- evaluating a match score for every other call of the same customer in which he or she speaks more than one minute using the fraud detection module (8);

listing the lower scored calls as fraud candidates.

15. A speech analytics system (S) according to claim 1 ; characterized by a method of detecting customer's gender by integrating a gender identification module (7), comprising the following steps;

- training HMM based gender models for male and female speakers;

- separating the customer channel of the call, and trimming the silent parts;

- using the gender identification module to identify the customer's gender.

16. A speech analytics system (S) according to claim 2; characterized in that it comprises a module of analyzing events in the call center, comprising the following steps;

- allowing the user to select two events to compare;

- retrieving the calls related to these events;

- using the retrieved calls' outputs of the said speech-to-text module (3.2) to find the unigram, bigram and trigram word statistics;

showing the most frequent word groups for both events, together with the most differentiating word groups between these events;

- using the retrieved calls' outputs of the said post-processing module (9) to find the analysis metric's averages for both events and the differences between these events, and showing them to the user;

marking the most statistically significant analysis metrics by using the variance analysis results.

17. A speech analytics system (S) according to claim 2; characterized in that it comprises a module of showing the agent performance, comprising the following steps;

calculating the minimum, maximum, and average of each analysis metrics' outputs that are generated by the said post-processing module (9), for the selected agent and certain time slice;

- showing the results as a blood-test table together with the regarding time slice's call- center averages and metrics' reference values.

Description:
DESCRIPTION

SPEECH ANALYTICS SYSTEM AND METHODOLOGY WITH ACCURATE STATISTICS The Related Art

The present invention relates to implementing new ways of automatically and robustly evaluating agent performance, customer satisfaction, campaign and competitor analysis in a call-center.

The present invention especially relates to automatically analyzing the conversations between agents and customers to extract useful information from call-center recordings.

The Prior Art

Today, speech-to-text systems are commonly used for speech-mining in call center applications as they can give rich speech to text outputs that can be used for many different information retrieval purposes. However, speech recognition performance of such systems degrade due to some harsh conditions such as background speech and noises, huge speaking style variability of the speakers and high perplexity of largely varying request content. This degradation in speech-to- text performance may affect the following analysis, and decrease the reliability of the statistics inferred from the speech-analytics system:

- Agent voice's monotonicity

- Agent speaking rate

- Agent/customer interrupt, block speaking counts

- Dialog-based analysis Speech-to-text systems also demand powerful analysis servers as the speech recognition module is also highly CPU-intensive. This situation creates extra need for hardware and increases the overall costs excessively.

In the prior art, speech analytics systems are gradually being improved and alternatives are being created for making accurate statistics into such kind of systems. However, accuracy rates of these input alternatives are not much higher. In conclusion; improvements are being made in the methods providing speech analytics and transmission of a speech as a text into hardware with minimum error, therefore new embodiments eliminating the disadvantages touched above and bringing solutions to existing systems are needed.

Purpose of the Invention

The present invention relates to a method meeting the above mentioned requirements, eliminating all the disadvantages and introducing some additional advantages, providing automatically and robustly evaluating agent performance, customer satisfaction, campaign and competitor analysis in a call-center.

A purpose of the invention is to construct a grid-based analysis structure that utilizes the agent PC's. This grid base structure eliminates the need for extra hardware and therefore reduces cost for the overall system implementation.

A still another purpose of the invention is to improve the speech recognition performance of the speech analytics system by training separate language models for agent and customer channels. In call-center conversations, agents and customers' speech contexts differ as agent speech is mostly script based, while customer speech is more free-context. So, agent and customer specific language models are trained separately, which decreases the language model perplexity for both channels and thus improving the speech recognition accuracies.

A still another purpose of the invention is to include a speech-filler model while training the acoustic models. This speech-filler model accounts for out of vocabulary words together with background speech and increases the speech recognition performance and the reliability of the speech-to-text module.

A still another purpose of the invention is to develop a method that filters the background speech using energy variations in the call. In call centers, there is a high level of background speech interfering with the agent's speech and noise suppression systems cannot deal with this type of non-stationary noise effectively. By using the energy information of the opposite channel, validity of current channel's speech regions are decided, and background speech parts are filtered.

A still another purpose of the invention Is to develop a robust method of calculating agents' speech. Agent's speech speed calculation is heavily affected by the speech recognition performance, and in cases having low recognition accuracies, this calculation may give wrong results. To overcome this problem, commonly used agent scripts which have higher recognition accuracies are used in speech speed calculations.

A further purpose of the invention is to develop a robust method of calculating interrupt counts. Interruptions during the conversation are good indications of participant behavior, and a stop- word filtering method is invented to get the correct counts.

A further purpose of the invention is to develop a robust method of calculating agent voice's monotonicity, which uses energy levels of the speech to filter out background speech.

A still purpose of the invention is to develop new methods of analyzing agent performance. One of these methods is to calculate a metric called agent hesitation count per minute. Agent hesitation count shows the confidence and control of the agent on the conversation subject, and it is a good measure of effective communication. Using the speech-to-text output of the agent channel, number of times that the agent hesitates is found, and then it is normalized by the agent speech duration on the entire call to get this metric's result.

A further purpose of the invention is to develop new methods of analyzing conversion effectiveness. One of these methods is to calculate a new metric called "customer share", which shows the participation of the customer to the conversation. Customer share is calculating by comparing the agent and customer speech durations on the call. Another method to analyze conversion effectiveness is to calculate the maximum silence duration of the call using the speech-to-text output of both channels to find the parts where no participant speaks. Another method to measure conversion effectiveness is to find the anger locations on both channels using the emotion recognition module.

A further purpose of the invention is to integrate a fraud-detection module to the speech analytics system, which shows the calls that are likely to be fraud using a voice verification engine. Also a gender detection module in integrated to automatically find the customer's gender.

A still purpose of the invention is to integrate new modules which allow the system user to examine the final results more efficiently. One of these modules is the statistical comparison module, which is used for analyzing events by comparing the text and other analyses outputs statistics. Another module is used for analyzing agent performance by showing the agent's performance as a blood-test table. This table shows the minimum, maximum, and average of each analysis parameter of an agent's calls from a certain time slice. The structural and characteristic features of the invention and all the advantages will be understood better in detailed descriptions with the figures given below and with reference to the figures, and therefore, the assessment should be made taking into account the said figures and detailed explanations.

Brief Description of the Drawings

To understand of the embodiment of present invention and its advantages with its additional components in the best way, it should be evaluated together with below described figures.

Figure 1. Speech analytics system which includes the grid-based analysis structure and the interaction of the elements required for implementation of the methods are illustrated schematically

Figure 2. Sub-modules of call analysis consumer servers (3) are illustrated as flowchart schematically.

Figure 3.Steps of the new method in filtering background speech by energy variation is shown as flowchart.

Figure 4.Steps of the fraud-detection method realizing by the fraud detection module (8) is illustrated schematically

Figure 5. It is a schematic view indicating the method steps of the statistical comparison component which includes speech-to-text module and emotion recognition module

Figure 6. It is a view of the agent/agent group statistics screen. A new GUI component to show the agent performance

The drawings do not need to be absolutely put to scales and details not essential to understand the present invention may have been omitted. Furthermore, the elements that are at least identical, or at least substantially have identical functions are illustrated with the same number.

Reference Numbers

1. oice Recording System

1.1 Recording Server

1.2 Database

2. Analysis Provider Server

3. Analysis Consumer Server

3.1. Call pre-processing Module

3.2. Speech-to-text Module

3.3. Emotion Recognition Module

7. Gender Identification Module 8. Fraud Detection Module

9. Post-processing Module

S:Speech Analytics System / System Detailed Description of the invention in this detailed description, the preferred embodiments of the method providing the subject of the invention are disclosed only for better understanding of the subject, and in a manner not constituting any restrictive effect.

Functions realized sequentially by the speech analytics system (S) being the subject of the invention; - Training the acoustic and language models that are used in the speech-to-text module (3.2)

- Training the emotion classification models that are used in the emotion recognition module (3.3)

- Training the gender identification models that are used in the gender identification module (7)

- Training the voice verification models that will be later used in the fraud detection module (8)

- Recording the agent-customer conversations in a call-center using the voice recording system (1)

- Adding the newly recorded calls from (1) to the analysis queue using the call analysis provider server (2)

- Picking the available call in the analysis queue and starting its analysis in the call analysis consumer (3)

- Separating the call into its customer and agent channels using the call- preprocessing module (3.1) of the analysis consumer (3)

- Automatically segmenting the agent/customer mono channels into voice/unvoiced and background speech parts using the call pre-processing module (3.1) of the analysis consumer (3)

- Automatically transcribing and time-aligning the agent/customer conversations using the speech-to-text module (3.2)

- Automatically classifying and labeling the agent/customer speeches as angry/normal using the emotion recognition module (3.3) - Automatically identifying the customer's gender using the gender identification module (7)

- Detecting the fraud probability of the customer using the fraud-detection module (8)

- Post-processing the outputs of the previous modules (3.2, 3.3, 7, 8) using the post processing module (9)

It is mentioned that speech analytics system (S) relates to automatically analyzes the conversations between agents and customers to extract useful information from call-center recordings. The extracted information can be used in evaluating agent performance, customer satisfaction, campaign and competitor analysis etc. in a call-center. Manual extraction of these information by people is a very costly process both time and resource wise, and only a very small ratio of all calls can be analyzed. By using the speech analytics system, all of the calls are automatically analyzed in a more affordable manner. In this work, new ways to improve the current speech analytics mechanism are implemented in order to get more reliable analysis results.

Speech analytics process starts with recording the calls using the voice recording system (1). This is a voice over IP based system that records the agent -customer calls in G711 , stereo format, and it contains a database, together with a physical storage. Briefly, records the agent - customer calls over station IP's and writes them in its database and physical storage.

The calls, which are suitable for speech analysis, are then added to the analysis queue by the analysis provider server (2). This server, gets the newly collected calls from the recording system (1) database, and queues the suitable calls to be analyzed in a separate table.

Then, analysis consumer servers (3) analyze the calls in the queue using its sub-components (modules), namely call pre-processing module (3.1), speech-to-text module (3.2) and emotion recognition module (3.3). In the system (S), a gender identification module (7), and a fraud detection module (8) to detect fraud customer calls are also included. Then a post-processing module (9) merges the outputs of the previous modules to calculate the final analysis results.

Call pre-processing module (3.1), the sub-module of the analysis consumer server (3), first separates the stereo calls recorded by voice recording system (1) into mono agent and agent/customer channels. These channels are to be analyzed separately in the following modules, and their outputs are then merged together to get the final analysis results. After the channel separation, each channel is automatically segmented into voiced and unvoiced parts using energy based VAD module which uses the energy variations in the call to decide the segments that contain speech. Only voiced segments are then used in the following modules. This module also detects the background speech, which is one of our claims in the application (Claim 4 in Claims section) Speech-to-text module (3.2), the another sub-module of the analysis consumer server (3), automatically transcribes the agent and customer speech segments into texts using pre-trained pattern recognition components, namely acoustic and language models which are trained using corpora of manually transcribed speech. Speech-to-text module (3.2) outputs a time-aligned transcription of both channels, and these transcriptions are used in the post-processing module(9) to obtain analysis results like agent speech speed, agent/customer interrupt and block speaking counts, maximum silence duration in the call, customer share ratio, and agent hesitation count.

Emotion recognition module (3.3), classifies the agent and customer speech segments into angry, and non-anger emotion states using predefined emotion recognition models which are trained using manually labeled data of normal and angry voice segments. Outputs of this module are used in analysis results like, agent/customer anger durations, ratios, and levels. This module also outputs agent's voice monotonicity, which shows the tone variation level of the agent voice.

There are also two new modules that work on analysis consumer server (3). The first one is the gender identification module (7), which identifies the gender of the customer using pre-trained pattern recognition models. The second one is a voice-verification based customer fraud detection module (8), which decides the fraud probability of the calls using the customer speech segments. The claims for these modules will be explained in the following sections.

The outputs of the previous modules are then conducted to the post-processing module (9) which collects the outputs of the previous modules (5, 6, 7, and 8) and merges them to show the final analysis results of the call. These analysis results are then used in evaluating agent performance, customer satisfaction, conversation effectiveness etc.

Speech analytics system's (S) analysis results are as follows (Claims are listed at the end of this document): - Speech-to-text outputs of agent/customer sides

- Anger duration/ratio of agent/customer sides

- Agent/customer anger locations (includes new method - Claim 11 )

- Overlap duration during the conversation - Agent/customer interrupt counts per minute (includes new method - Claim 6)

- Agent customer block speaking durations

- Agent's speech speed (includes new method - Claim 5)

- Agent's voice monotonicity level (includes new method - Claim 7)

- Agent's hesitation count per minute (includes new method - Claim 8)

- Customer share (participation) (includes new method - Claim 9)

- Total silence duration/ratio

- Maximum silence duration (includes new method - Claim 10)

- Customer gender - (includes new method - Claim 13)

- Fraud probability of the call (includes new method - Claim 12)

These results then can be compared and analyzed using:

- Statistical comparison component (includes new component - Claim 1 )

- Agent performance GUI (includes new component - Claim 15)

Training method of the "agent/customer language models" which are used in the Speech-to-text module (5):

An essential component of all speech-to-text systems is the "language model" which calculates the probability of generating word sequences in a language. Language modeling is data-driven technique, in order to estimate the probability distributions of the word sequences; data is trained, which is text. Traditional speech-to-text systems, train language models from a single text, and use this model in their ongoing analysis.

In this case, there is a stereo voice recording system (1) in the speech analytics system (S) which allows us to separate agent and customer speech from each other and therefore analyze them individually in the speech-to-text module (3.2). We train and afterwards use different Language Models (LM) for speech-to-text transcription since agent speech is mostly script based and they have a limited vocabulary. This allows us to obtain a higher level of accuracy for speech to text transcription. For customer speech there is a wider range of topics and vocabulary. Therefore different LMs are trained and used for agents and customers in the speech-to-text module (3.2) and this new method increases the speech-to-text correctness of the system (S), thus the analysis reliability.

Training method of the "acoustic mode!" which are used in the Speech-to-text module (3.2): An "acoustic model" resembles the vocal characteristic of the speech units, and their relations with each other. Acoustic models are also data driven, and trained from manually-transcribed speech segments. In acoustic model training and usage, there is a speech-filler model that accounts for out of vocabulary (OOV) words and background speech in order to increase the speech recognition performance and the reliability of the speech-to-text module (3.2). Some previous systems also train additional background models like [JP5204394], but in their case acoustic models are word-based, opposed to our models which are phoneme based. Phoneme-based models that are used for normal and background speech represent the voice characteristics better and in more detail; hence they result in higher speech-to-text accuracies.

A grid-based structure to utilize agent PC's as the analysis consumer servers (3):

Traditionally there are multiple powerful and costly analysis consumer servers (3) in a speech analytics system (S) dedicated for speech-to-text module (3.2) and emotion recognition module (3.3). In the system (S) a grid-based structure is applied to utilize agent PC's in case of analysis consumer servers (3). This eliminates the need for extra hardware and therefore reduces cost for the overall system implementation as agent PCs that are utilized already exists in the call- center, and no additional server investments are needed.

The method for filtering the background speech using energy variations in the preprocessing module (4):

In call centers, there is a high level of background speech interfering with the agent's speech. Noise suppression systems cannot deal with this type of non-stationary noise effectively. Since the speech recognition system cannot differentiate between agent's speech and other speech in the background. This background speech is also transcribed. These wrong transcriptions decrease the accuracy of statistical analyses in a speech analytics system. In this system (S), we use the information from opposite channel in deciding about the validity of current channel's speech regions. Normally it is not very likely that throughout the whole conversation both parties will be talking at the same time. Therefore system (S) analyzes the regions where the opposite party in not speaking using the previous voiced/unvoiced decisions. From those regions, an estimate volume level is calculated for the current participant's speech. Then based on this volume level, a certain ratio of that volume level is set as a threshold for the voice activity detection system. Most of the background speech is eliminated from the system by using this threshold and regarding segment belc-w this threshold as background speech. Some previous methods also use energy based SNR threshold techniques to estimate the speech level such as [US2008228478], but they use the whole sound recording. In this case, we use the opposite channel's information in deciding about the validity of speech regions, leading to a better estimation of the normal speech volume level.

Robust method of calculating agent's speech speed in the post-processing module (9) for agent performance analysis:

Speed calculations are done by the post-processing module (9) using the text and time-aligned outputs of the speech-to-text module (3.2). To find the agent's speech speed, system (S) calculates the total duration of agent's speech from the time-aligned outputs, and divides this duration to the total number of letters in the output text.

This method's correctness is strongly-related the speech-to-text module's (3.2) performance: As the speech-to-text module (3.2) makes more recognition errors the reliability of this metric degrades significantly. Agents usually follow routine scripts in their speech. If the length of those scripts are long (e.g. "Welcome to how can I help you") then false alarm rate of the speech- to-text outputs decreases significantly. So, if frequently used 3-4 word phrases are used as major indicators for agent speed computation, then a much higher accuracy in agent speed computation can be achieved. Therefore, in the agent speed computation; a text area, where commonly used agent scripts and phrases with high recognition accuracies can be entered, is employed. Then, the speech-to-text output of the agent channel is searched for these phrases, and only uses these phrases' letter counts and durations in speech speed calculations.

A robust method of calculating interrupt counts in the post-processing module (9) for agent performance and customer satisfaction analysis: interruptions during the agent/customer conversations are good indicators of agent performance and customer satisfaction. Agents shouldn't interrupt the customers, and if a customer interrupts the agent many times during the conversation, it shows that the customer is angry or unsatisfied.

Interrupt counts are found from the speech-to-text module's (3.2) output text and its time- alignments. But there are some approval words like "yes", "ok" that shouldn't have count as interruptions. Other systems may get erroneous results by taking these kinds of words into account. In this method, the users are allowed to enter this kind of "stop words" to the system, and ignore these words if they occur in the speech-to-text outputs when calculating the interrupt counts. The filler words are also ignored in the system and the interrupt counts are normalized by other party's speech duration to have more useful and robust results. This new normalized analysis metric is called "interrupt count per minute".

A robust method of calculating agent's voice monotonicity level calculation in the postprocessing module (9):

A robust method is implemented for agent voice's monotonicity level evaluation. Agent's voice monotonicity level is used in agent performance analysis as agents shouldn't speak in a monotone manner with the customers.

In some cases, background speech may be from a different gender, resulting in pitch with high variance and lower monotonicity level. In order to solve this problem we concentrate on energy level as well and disregard low energy regions in statistical pitch calculations.

The method for analyzing agent performance by a proposed metric called "Hesitation count per minute":

Using the outputs of the speech-to-text module (3.2), we find the speech parts that the agent hesitates and normalize the hesitation count by agent speech duration on the entire call. This new parameter "agent hesitation count per minute", which is calculated in the post-processing module (9), shows the confidence and control of the agent on the conversation subject and it is used as an additional performance criterion as hesitations can become an advantage in measuring effective communication.

The method of evaluating conversation effectiveness by a proposed metric called "Customer share":

Customer's participation to the conversations is a good measure of evaluating the conversation effectiveness. A new metric called "customer share", which shows if the conversation is more like a monologue or a dialogue, is proposed. This metric is calculated in the post-processing module (9) by dividing the customer speech duration by the total speech duration of the agent and customer. These speech durations are extracted from the speech-to-text module's (3.2) outputs.

The method of evaluating conversation effectiveness by a proposed metric called: Long silences in the conversation decrease its effectiveness, increase the call-cost for the call- center, bore the customer, and create suspicions about the agents control on the subject. So these are good indicators of conversation effectiveness and agent performance. So, an analysis parameter called "maximum silence" is improved. Said parameter is calculated from the speech-to-text module's (3.2) text outputs of agent and customer channels, and their time-alignments. The parts where both of the participants do not speak are labeled as silent segments in the post-processing module (9), and the segment that has the longest duration as shown to the user as the "maximum silence duration".

The method for evaluating emotion pattern of the conversation by a proposed metric called "Anger locations":

Anger content of the conversation is calculated by the automatic emotion recognition module (3.3). This module uses Support Vector Machines (SVMs) to train classification the models, and then classifies the agent/customer speech segments into angry/non-angry emotion states.

Finding, and showing the anger durations, or levels of the conversation gives us useful clues about the agent performance or customer satisfaction, but they are not sufficient for complete analysis alone. For example, a customer who is only angry at the beginning of the conversation or who is only angry at the end of the conversation may have equal anger durations, but these cases differ in the case of agent performance analysis. The first case is an indicator of good performance as the agent soothes the customer during the conversation, but the second case is vice-versa. So, we propose a new metric called "anger location" to handle this kind of situations. In calculating the anger durations, entire call is divided into three portions (beginning, middle, and end) and these portions are labeled if they contain anger that is found by the emotion recognition module (9). The results of this metric are shown with three successive Boolean variables, where "1" shows found anger in the portion. For example, if customer anger location result comes out to be "1-0-0", it shows that the customer is angry at the beginning of the conversation, but he/she calms down in the later parts.

The method for detecting possible fraud situations by the fraud detection module (8):

A fraud detection module (9), in which a GMM-based pre-trained voice verification system is integrated into the speech analytics system in tfie analysis consumer servers (1), is implemented. For instance, the first call over 1 minute as original talker, and evaluate the match score for every other call for the same customer. The lower scored calls are then listed as possible fraud calls.

Some previous systems, which is numbered [US2011010173], also implement a fraud-detection mechanism, but they only rely on delays and fails when the customer is answering which is extracted from the speech-to-text output. In this system, a fraud detection module (8) is implemented; Said fraud detection module (8) uses customer's voice characteristics when deciding the fraud situation. This module is also more reliable as no voice characteristics are considered in the previous system, and using only speech-to-text output may lead to false decisions.

The customer's gender detection component:

An automatic gender identification module (8) that works on the analysis consumer servers (3) is employed. Said module (8) identifies the gender of the customer by using Hidden Markov Models (HMMs). Although customer's gender is known at the time of the conversation, its automatic detection can be useful in later statistical analysis in the call-center.

The statistical comparison component for analyzing events in the call center:

A statistics comparison component is employed in the post-processing module (9) for analyzing two similar events. For example; it is assumed that agent A is more successful than agent B in selling a product. Statistics comparison component compares speech-to-text module's (3.2) text outputs of agent A, and compare them with agent B's and then lists the words or phrases which differ in terms of relative frequency most. The results are listed both graphically and in terms of numbers. The process starts with retrieving the calls which are filtered by the chosen queries of the user within the corresponding time slices. Then using the speech-to-text outputs of these queries, unigram, bigram and trigram word statistics are found in the post processing module (9).

Most frequent word groups are then shown for each query together with their differences. Some previous systems [US2006277465] also use co-frequency analysis for given text inputs when creating relation maps, but they use already ready texts like internet page contents. In this case, there are no texts at the beginning of the process and system uses texts constituted from the outputs of our speech-to-text module. System (S) also uses differences of the text frequencies rather than the similarities. Other analysis metric's averages and differences are also calculated by the post-processing module (9), and the most statistically significant parameters are marked by using variance analysis (ANOVA). A GUI component to show the agent performance:

The agent' performance results are shown as a blood-test table which is shown in Figure 6. This GUI component shows the minimum, maximum, and average of each analysis parameter of an agent's calls from a certain time slice. It also shows the call-center averages and reference values, and if a parameter diverges from its reference value, it is labeled with stars corresponding to the divergence level. All of the analysis statistics are calculated at the postprocessing module (9).