Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM FOR AUTOMATIC ASSESSMENT OF FLUENCY IN SPOKEN LANGUAGE AND A METHOD THEREOF
Document Type and Number:
WIPO Patent Application WO/2021/074721
Kind Code:
A2
Abstract:
The present invention proposes an invention for acoustic analyses of recorded speech, carried out to rate a speaker's oral skills. In the disclosed system, the prosodic features of the speech signal are used to rate spoken language fluency. This is achieved differently in each of the two possible scenarios: (i) Text-dependent: where the speaker reads out presented text and (ii) Text- independent: where the speaker utters a monologue using words of her choice. The disclosed system comprises a user attributes extraction block (1) to accept input, a model attributes extraction block (2) to extract corresponding set of model attributes and an attributes comparison block (3) for comparison and for obtaining of fluency report (9).

Inventors:
RAO PREETI (IN)
SABU KAMINI (IN)
NAYAK NAGESH SATISH (IN)
SHREEHARSHA BOKKAHALLI SATISH (IN)
Application Number:
PCT/IB2020/058978
Publication Date:
April 22, 2021
Filing Date:
September 25, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INDIAN INST TECHNOLOGY BOMBAY (IN)
International Classes:
G10L13/06
Attorney, Agent or Firm:
MAJUMDAR, Subhatosh et al. (IN)
Download PDF:
Claims:
Claims:

1. A system for performing speaking fluency assessment with focused feedback on Comprehensibility, Confidence and Cadence comprising a trained Automatic Speech Recognition (ASR) unit and a prosody detection unit for obtaining a transcription of speech, wherein the speech is recorded using an input unit, wherein the said system comprises: a user attributes extraction block (1) configured to accept a user audio as input and to extract a set of user attributes; a model attributes extraction block (2) configured to receive model text and the corresponding information structure and to extract the corresponding set of model attributes; and an attributes comparison block (3) configured to perform comparison of the set of user attributes and the set of model attributes to obtain a fluency report (9).

2. The system as claimed in claim 1, wherein the set of model attributes (8) serves as a reference for comparison with the set of user attributes.

3. The system as claimed in claim 1, wherein the model attributes extraction block is configured to derive the context dependent information.

4. The system as claimed in claim 3, wherein the context dependent information comprises phrase boundaries for delineating meaningful word groups, prominent words expected to stress to convey focus or new information, and the intonation associated with the sentence mode.

5. The system as claimed in claim 1, wherein the input unitprovidesthe user’s audio as speech recorded directly using a microphone.

6. The system as claimed in claim 5, wherein the microphone is adapted to convert the user’s voice into an audio signal stored in digital form.

7. The system as claimed in claim 1, wherein the user’s audio is from a previously stored audio source.

8. The system as claimed in claim 1, wherein the user input to the User Attributes Extraction Block comprises user’s recording having video signal to capture facial expressions and actions of the user.

9. The system as claimed in claim 1, wherein the audio is passed through a speech/silence detector (102).

10. The system as claimed in claim 9, wherein the speech/silence detector (102) is configured to identify contiguous regions of speech or silence in the audio signal.

11. The system as claimed claim 9, wherein the audio is passed through a speech enhancement block (103) to suppress the background noise present in the signal.

12 The system as claimed in claim 11, wherein the audio signal enhancement is based on Generative Adversarial Network deep learning.

13. The system as claimed in claim 1, wherein the system comprises a Word level features and classifier block (104), wherein the said block is configured to accept cleaned up speech and silence/speech labels as input and to provide word level attributes related to prominence and phrase boundary type as output.

14. The system as claimed in claim 13, wherein the word level features and classifier block (104) is configured to extract prosody and silence duration prior to the word and post the word.

15. The system as claimed in claim 14, wherein normalization and statistical functions are performed on the extracted features.

16. The system as claimed in claim 1, wherein the attributes comparison block is configured to match the text hypothesis with aligned model text to detect the lexical miscues.

17. The system as claimed in claim 16, wherein the number of miscues obtained and the estimated speech duration is used to determine the Words Correct per Minute (WCPM) metric.

18. The system as claimed in claim 1, wherein the comparison block is configured to perform the Lexical Assessment, wherein the individual count of the different types of lexical miscues forms the basis for the Lexical Assessment.

19. The system as claimed in claim 1, wherein the prosodic assessment includes rating the user on the following attributes: pace, phrasing or intonation and word prominence or stress.

20. The system as claimed in claim 19, wherein pace is determined by comparing the typical values for speech rates of a particular speech style with the user’s speech rate to determine if the pace of the speaker is optimal.

21. The system as claimed in claim 19, wherein the phrasing and prominence are predicted by using the word-level prosodic pre-computed features.

22. The system as claimed in claim 21, wherein the pre-computed features comprise comparing the stress detected for each words with the stress expected for each word to give the prominence score.

23. The system as claimed in claim 1, wherein the determining if the sentences end correctly by using the information from model attributes and word level intonation attributes.

24. The system as claimed in claim 1, wherein scoring the expressiveness of the user’s speech at three levels given by monotonous, sing-song style and proper expressive reading.

25. The system as claimed in claim 1, wherein the system comprises model text listening and recording screen to present the model text to be read to the user.

26. The system as claimed in claim 1, wherein the ASR is configured to generate an assessment report card for a user.

27. The system as claimed in claim 26, wherein the assessment report card includes fluency feedback component, lexical component and prosody feedback component.

28. The system as claimed in claim 1, wherein the system comprises means for listening to the recorded audio.

29. The system as claimed in claim 1, wherein the system comprises means to slow down the playback of the audio.

30. The system as claimed in claim 1, wherein the ASR is further configured to assess the audio recording input and to generate a report for the text-independent assessment.

31. The system as claimed in claim 30, wherein the detected words are tagged by the ASR, followed by applying a part-of-speech (PoS) tagger to the input to obtain the model information structure.

32. The system as claimed in claim 31, wherein the reference phrases and prominent words are formed for comparison with the detected user attributes.

33. The system as claimed in claim 30, wherein the statistical summary of detected pauses and pitch variations of the input is compared with expected norms for a fluent speaker.

34. The system as claimed in claim 1, wherein the lexical and prosodic assessments are combined via a trained classifier for predicting the comprehensibility.

35. The system as claimed in claim 34, wherein the said trained classifier is obtained by deep learning wherein the deep learning is based on human expert judgments of comprehensibility.

36. The system as claimed in claim 1, wherein the voice quality features and prosody features of the input audio are combined via another trained classifier for predicting the confidence of the user.

37. The system as claimed in claim 36, wherein the said trained classifier is obtained by deep learning wherein the deep learning is based on human expert judgments of confidence.

Description:
SYSTEM FOR AUTOMATIC ASSESSMENT OF FUUENCY IN SPOKEN UANGUAGE AND A METHOD THEREOF

TECHNICAU FIEUD OF THE INVENTION

The present subject matter described herein relates to an audio/voice processing system, particularly it relates to a system and method for assessing speech, more particularly, to a system and method for assessing prosody of speech data.

BACKGROUND OF THE INVENTION

A person’s speech in any language conveys meaning through its lexical content (what is said) and its prosody (how it is said). Prosody involves the manipulation of supra- segmental attributes such as stress, pauses, intonation, loudness and relative durations of the words. Akin to punctuation in written text, prosody enhances the structural aspects of spoken language, makes prominent the significant words and serves to bring out affect, all of which function to contribute to the comprehensibility of the message or text.

While there are several patents on speech assessment in general in terms of word pronunciation correctness, there are fewer that assess lexical and prosodic fluencies. Lexical fluency is evaluated based on detected hesitations, repetitions and filled/unfilled pauses. Prosodic fluency is typically assessed in terms of intonation pattern matching with a reference pattern such as that from a tutor’s utterance of the same text.

Reference is made to US 2011/0270605 Al. The system disclosed in this patent application assesses input speech based on a prosody constraint obtained from similar reference text stored in a database.

Again, reference is made to US20120245942 Al that discloses that computer- implemented systems and methods for evaluating prosodic features of speech in which a speech sample is received, where the speech sample is associated with a script. The speech sample is aligned with the script. In the said document, an event recognition metric of the speech sample is extracted, and locations of prosodic events are detected in the speech sample based on the event recognition metric. The locations of the detected prosodic events are compared with locations of model prosodic events, where the locations of model prosodic events identify expected locations of prosodic events of a fluent, native speaker speaking the script. A prosodic event metric is calculated based on the comparison, and the speech sample is scored using a scoring model based upon the prosodic event metric.

Further, reference is made to US20080140401A1 that discloses a method and apparatus for providing feedback to the reader on their reading performance which includes generating a grammar from an input text, accepting a receiver’s utterance and providing a score based on accuracy in terms of correctly read words, pronunciation, fluency and expressiveness. The document focuses on identifying deviations from the grammar of the input text to detect mistakes like skipped/repeated words, directionality, structural restarts, jump aheads, and fillers.

Reference is made to US9947322B2 that discloses systems and methods for automated evaluation of human speech that may include a microphone coupled with a computing device comprising a microprocessor, a memory, and a display operatively coupled together. The microphone may be configured to receive an audible unconstrained speech utterance from a user whose proficiency in a language is being tested and provide a corresponding audio signal to the computing device. The microprocessor and memory may use the plurality of supra- segmental parameters to calculate a language proficiency rating for the user and display the language proficiency rating of the user on the display associated with the computing device.

Another reference is made to US20090258333A1 that describes a system to capture user audio and analyse the audio. The analysis involves identifying the spoken sentences with corresponding phonemic structure and prosodic features using fundamental frequency, duration and energy of a segment. Apart from this, they talk about identifying keywords with confidence in the spoken text. They also describe different feedback mechanisms for achieving certain goals in language learning, one of which is showing user's pitch trajectory for tonal languages.

Similarly another reference is made to US20040006468A1 that describes a method and system for automatic pronunciation scoring in the context of language learning. The pronunciation score is derived from a combination of articulation, duration and intonation scoring engines. Each of these engines compares the users’ attributes to a reference attribute. The articulation scoring engine compares the phonemic representation, the duration engine compares the relative durations of the syllable, and the intonation engine compares the pitch contours.

Still further reference is made to US20080177545A1 that describes a system for automatic reading tutoring. It described the construction of a domain specific language model to the input text being read. It also describes the use of a garbage model for detection of different categories of miscues (word repetition, breath, partial word, pause, hesitation or elongation, wrong word, mispronunciation, background noise, interjection or insertion, non-speech sound, and hyperarticulation) typical of users in a reading tutoring application.

Yet again reference is made to TBall (S. Narayanan, USCLA, 2011), LISTEN (Mostow, CMU, 2012) and FLORA (Bolanosetal., University of Colorado, Boulder, 2013 ) represent the major past academic research that targeted automated assessment of children’s reading. TBall dealt with kindergarten students reading isolated word lists. They used specially designed language models in ASR to detect lexical disfluency. However, lexical disfluency is not enough to rate the oral reading fluency. The LISTEN group correlates the prosodic contours (pitch, energy, intensity and latency) of child speech with corresponding adult speech to rate fluency of 7-10 years old children reading short stories. However, the adult speech may not always be available or usable, especially when student makes substitution or omission. The FLORA group grades overall literacy for 1 min paragraph reading by children of grades 1-4. They use lexical as well as prosodic features for giving overall literacy score. Component-wise grading may, however, prove to be more beneficial in giving feedback to students. For speaking assessment, apart from pronunciation, fluency, phrasing, and prominence which have already been mentioned in prior art, confidence, cadence & comprehensibility are additional attributes we are claiming to be non-obvious/inventive. In view of the prior art discussed herein, there remains a dire need of a system that can rate spoken language fluency in each of the two possible cases: (i) Text-dependent: where the speaker reads out presented text (ii) Text-independent: where the speaker utters a monologue using words of his/ her choice.

SUMMARY OF THE INVENTION

The following disclosure presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the present invention. It is not intended to identify the key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concept of the invention in a simplified form as a prelude to a more detailed description of the invention presented later.

An object of the present invention is to overcome the drawbacks associated with the prior arts.

Another object of the present invention is to provide a solution that can rate spoken language fluency in each of the two possible cases: (i) Text-dependent: where the speaker reads out presented text (ii) Text-independent: where the speaker utters a monologue using words of his/ her choice.

Yet another object of the present invention is to provide a solution for acoustic analyses of recorded speech carried out to rate a speaker’s oral skills.

Briefly, various aspects of the subject matter described herein are directed at a system for performing speaking fluency assessment comprising a trained Automatic Speech Recognition (ASR) unit for obtaining a transcription of speech, wherein the speech is recorded using an input unit, wherein the said system comprises: a user attributes extraction block configured to accept an user audio as input and to extract a set of user attributes; a model attributes extraction block configured to receive model text and the corresponding information structure and to extract the corresponding set of model attributes; and an attributes comparison block configured to perform comparison of the set of user attributes and the set of model attributes to obtain a fluency report.

Other salient features and advantages of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The above and other aspects, features, and advantages of certain exemplary embodiments of the present invention will be more apparent from the following description taken in conjunction with the accompanying drawings in which:

Figure 1 illustrates a flowchart of a process used by an implementation of a text independent system as per the prior art of the present invention.

Figure 2 illustrates the system for performing speaking fluency assessment in a text-dependent case according to the present invention.

Figure 3 illustrates the details of the Model Attributes Extraction block according to the present invention.

Figure 4 illustrates the details of the User Attributes Extraction block in which the user audio signal is the user’s speech which is recorded directly using a microphone according to the present invention.

Figure 5 illustrates the internal blocks of the word level features and classifier block according to the present invention. Figure 6 illustrates the attributes comparison block of the system according to the present invention.

Figure 7 illustrates the model text for a short paragraph as displayed on a tablet according to the present invention.

Figure 8 illustrates a sample assessment report card for a user according to the present invention.

Figure 9 illustrates the detailed lexical feedback according to the present invention.

Persons skilled in the art will appreciate that elements in the figures are illustrated for simplicity and clarity and may have not been drawn to scale. For example, the dimensions of some of the elements in the figure may be exaggerated relative to other elements to help to improve understanding of various exemplary embodiments of the present disclosure. Throughout the drawings, it should be noted that like reference numbers are used to depict the same or similar elements, features, and structures.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention. It includes various specific details to assist in that understanding but these are to be regarded as merely exemplary.

Accordingly, those skilled in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the invention. Accordingly, it should be apparent to those skilled in the art that the following description of exemplary embodiments of the present invention are provided for illustration purpose only and not for the purpose of limiting the invention as defined by their equivalents.

It is to be understood that the singular forms “a”, “an” and “the” include plural referents unless the context clearly dictates otherwise.

By the term “substantially” wherever used or will be used later it is meant that the recited characteristic, parameter, or value need not be achieved exactly, but that deviations or variations, including for example, tolerances, measurement error, measurement accuracy limitations and other factors known to those of skill in the art, may occur in amounts that do not preclude the effect the characteristic was intended to provide.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

It should be emphasized that the term “comprises/comprising” when used in this specification is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.

In an embodiment of the present invention, the components for both text- dependent and text-independent speech are disclosed. The evaluated attributes are augmented with new high-level aspects such as confidence and presence of cadence possibly transferred from the native tongue. Figure 1 illustrates a flowchart of a process used by an implementation of a text independent system as per the prior art of the present invention.

In an embodiment of the present invention, an automatic assessment system that can rate spoken language fluency is disclosed. The said system is configured to rate spoken language fluency in each of the two possible scenarios: (i) Text-dependent: where the speaker reads out presented text (ii) Text-independent: where the speaker utters a monologue using words of his/ her choice.

In Case (i), the lexical and information structure of the known text are used to obtain a reference which can be used in a comparison with the speech parameters obtained by the acoustic analyses. The information structure comprises the locations of phrase boundaries and prominent words. Suitable prosodic parameter extraction and machine learning methods are proposed for the comparison of the reference and user data in order to obtain automatic ratings of oral skill that match human judgments.

In Case (ii), a state-of-the-art trained Automatic Speech Recognition (ASR) system is used to obtain a transcription of the recorded speech and identify parts of speech such as content and function words. Next, the phrase boundaries and prominences detected by the acoustic analysis are evaluated for appropriateness with respect to the obtained text transcription. The observed mismatches between expected and actual prosodic events are used to derive an estimate of the fluency of the passage.

The system is trained on a large dataset of recorded readings across speakers of various skill levels. The recordings have been rated by language experts for fluency using a scale ranging from Poor (or Struggling) to Fluent. Expert judgments on overall comprehensibility, perceived confidence and expressiveness are also obtained for each recording. The goal of the training is to determine the acoustic features and machine classifier that maximize the system’s performance in terms of achieved correlation with the corresponding human expert ratings.

The automatic fluency assessment system as described here can find practical application in literacy assessment, speaking skill evaluation and computer-aided language learning (CALL).

In an embodiment of the present invention is disclosed a system for automatic assessment wherein the system comprises an input unit for providing audio inputs; a module for extraction of the lexical and the information structure of a known text; wherein the said information structure comprises the locations of phrase boundaries and prominent words; applying automatic speech recognition and prosodic parameter extraction and machine learning methods for the comparison of the input audio with the model from the known text and obtaining a result. The system outputting automatic ratings of oral skill that match human judgments based on the obtained results. The system further includes a trained Automatic Speech Recognition (ASR) unit for obtaining a transcription of speech, wherein the speech is recorded using a recording unit. The system identifies the different parts of speech. The system includes the evaluation of the phrase boundaries and prominences detected by the acoustic analysis with respect to the obtained text transcription. The system provides for comparing between the expected and actual prosodic events such that after comparing, mismatches are obtained to derive an estimate of the fluency of the input speech.

Figure 2 shows the system for performing speaking fluency assessment in a text- dependent case as per an embodiment of the present invention. Overall there are three main blocks in the system, namely, the user attributes extraction (1); model attributes extraction blocks (2) and the attributes comparison block (3). The user audio (4) serves as the input to the user attributes extraction block. The audio signal is analysed and the user attributes (7) necessary for fluency assessment are estimated. The model attributes extraction block receives model text (5) and its information structure (6). It further extracts the corresponding model attributes (8) which serve as a reference for comparison with the user attributes. The comparison of the two sets of attributes is then performed in the attributes comparison block to obtain a fluency report (9). The system is not restricted in its use for a particular language but can work equally well for any language with a written script.

Figure 3 shows the details of the Model Attributes Extraction block, according to an embodiment of the present invention. The model text (202) ideally contains a paragraph to be read by the user. It is necessary that the user be presented with a paragraph containing at least 3 or 4 sentences in order to reliably assess the user’s speaking skills. From the text corresponding to these sentences, the model attributes extraction block (203) derives the following context dependent information: phrase boundaries which are used to delineate meaningful word groups, prominent words which the user is expected to stress in order to convey focus or new information, and the intonation associated with the sentence mode (rising for a question, falling for a statement, etc.) as well as affect, if needed.

The punctuation provided in each sentence of the model text paragraph contains information related to the locations of pauses and breaks in the sentence as well as the intonation necessary for sentence endings. Other prosody related information not available from the model text, is separately provided by the model information structure (201). This contains the location of phrase breaks used for meaningful chunking of words as well as the words which need to be stressed and thus made prominent in the sentence with their relative importance. For instance, in fast speech, the lower importance prominences may be omitted but not the highest ones. Additionally, it contains information related to the difficulty level of the text. This model information structure is annotated manually using a customized interface made available in a suitable form such as an xml. The model text and information structure can also be automatically obtained using the audio recording of a model speaker.

The model text is also used for building the language model used by the automatic speech recognition (ASR) engine and for the word sequence needed for correct alignment of the acoustic signal to detect word miscues. Apart from the context dependent information, certain context independent attributes are also part of the model attributes block. These are attributes like speech rate which depend mainly on the speaking style. For example, for drama applications, slow speaking or fast speaking may be desired compared to conversational speech where the average speech rate is expected to be generally between 120-160 words per minute.

Figure 4 shows a User Attributes Extraction Block according to an embodiment of the present invention. With reference to figure 4, the user audio signal (101) is the user’s speech which is recorded directly using a microphone. The microphone converts the user’s voice into an audio signal which is stored in digital form. Alternatively, the user’s audio could be made available from a source having audio previously stored. The said alternative source includes audio sources like a media player, streamed from the internet or downloaded from an optical device, web server, or a storage device. Additionally, the user’s recording could also include a video signal to capture facial expressions and actions of the user and used as additional cues for the assessment of a user’s performance in a speech and drama context.

To remove extra silence regions, the audio is passed through a speech/silence detector (102). The speech/silence detector identifies contiguous regions of speech or silence in the audio signal to obtain what are henceforth referred to as silence/speech labels. This is followed by a speech enhancement block (103) which suppresses most of the background noise present in the signal. Generative Adversarial Network (GAN - deep learning) based enhancement is desired which does not distort the spectrum as is done by spectrum magnitude enhancement based methods. The noise profiling is done in the silence regions for better suppression of noise in the speech regions. The speech enhancement stage leads to reliable feature extraction in the succeeding stages and also helps improve the ASR performance.

Figure 4 also depicts the word-level labels are passed on to the Word level features and classifier block (104). This block also takes the cleaned up speech and silence/speech labels as input and outputs word level attributes related to prominence and phrase boundary type, according this embodiment of the invention.

The ASR decoder (105) converts the user’s audio signal into a text transcript. As well-known to people skilled in the art, the ASR decoder needs an acoustic model and a language model which are part of the Trained Speech Models block. The acoustic model is trained using transcribed audio data of users belonging to the same demographic as the one using the application. For example, if the application is to be used for teaching children of a particular age group the acoustic model would be trained using transcribed audio data of children belonging to the same age group. Apart from age, a different acoustic model may need to be trained for different regions to account for local language influences. In case of lack of sufficient transcribed audio data, existing models may be adapted to cater to the new demographic. A dictionary corresponding to the language of the text is used as the pronunciation lexicon. A trigram language model is trained using the model text. All the words from canonical transcription along with the commonly expected substitutions, disfluency fillers and a phone sequence garbage model are used to build a text-specific language model.

The ASR decoder provides the text transcript for a particular user’s speech input. The transcript is accompanied with time-coded labels at different levels of granularity: start and end times of each word, syllable, and phoneme. The speech rate of the user is then calculated using the ratio of number of words, or sub-words such as syllables or phones, and the reading duration. The reading duration may or may not include the detected silence regions based on the application.

Figure 5 elaborates on the internal blocks of the word level features and classifier block according to an embodiment of the invention. Features computed to assess prosody such as pitch, energy and spectral tilt are calculated at the word level. Along with these, the silence duration prior to the word and post the word is also used. Appropriate normalization and context is considered during prosody feature extraction. This is done to minimize the speaker specific variability. Different statistical functional like mean, minimum, maximum, standard deviation, min-max range are calculated for each feature. All these word level features are utilized in the prosodic event detection classifiers. One classifier acts as phrasal break detector while other detects stressed words. Lexical stress assessment, to identify stress on each syllable of a word, can be implemented by following a similar procedure at the syllable level. The lexical and prosodic assessments are combined via a trained classifier for predicting the comprehensibility. The voice quality features and prosody features of the input are combined via another trained classifier for predicting the confidence of the user. The said trained classifiers are obtained by deep learning. The classifier of the former case is obtained by deep learning based on human expert judgments of comprehensibility and while for the latter case is obtained by deep learning based on human expert judgments of confidence. Using the pitch of the narration, it can be determined if the word has been uttered with a rising/falling/neutral intonation which serves as a cue to phrase boundary type (i.e. sentence final or intermediate). The same cue could also be used for rating the tonal aspects of languages such as Mandarin among others. Spectral cues such as spectral envelope tilt are passed onto the next block to obtain fluency assessment related attributes as described in the next section. The estimated pitch and spectral tilt can also be used to represent affect.

Figure 6 shows the attributes comparison block according an embodiment of the invention. The text hypothesis is matched with the aligned model text to detect word level lexical miscues in the form of substitution, insertion, omission, or disfluency. The total number of miscues along with the estimated speech duration is used to determine the Words Correct Per Minute (WCPM) metric which as the name suggests is the average number of words correctly uttered by the user in a minute. This along with the individual count of the different types of lexical miscues forms the basis for the Lexical Assessment.

The Prosodic assessment involves rating the user on the following attributes: pace, phrasing or intonation and word prominence or stress. Pace is determined by comparing the typical values for speech rates of a particular speech style with the user’s speech rate to determine if the pace of the speaker is optimal. The consistency of the pace is also evaluated. The remaining two prosodic attributes viz. phrasing and prominence are predicted using the word-level prosodic features computed earlier.

Locations of detected phrasal breaks are compared with the locations of grammatically expected phrasal breaks as conveyed by the information structure of the model text. If all the detected breaks perfectly match the expected ones, phrasing is rated at the highest level. On the other hand, if the phrasal breaks occur at almost every word leading to word-by-word or list form reading, the lowest phrasing score can be given. Other intermediate levels will involve uneven pauses and awkward positions of breaks. Detected phrase breaks deemed correct are separately classified into phrase and sentence boundaries to compare with the model information structure. Prominence is indicated by the word stress. List of expected stressed words are supplied beforehand based on whether they add new and important information in the given context. Comparing the detected stressed words with the expected ones and their relative importance, a prominence score is determined. The detected prominences along with phrase intonation contribute to the expression score.

Using the information from the model attributes and word-level intonation attributes, it can also be deduced if the sentence endings were realized correctly. For interrogative sentence, the last word in the sentence ends with a rising intonation compared to a sentence containing a declarative statement where there is a falling intonation. Based on the number of sentence endings correctly realized, an appropriate rating may be provided to the user.

Expressiveness indicates whether the speaker speaks in flat monotonous voice or performs prosodic (e.g. pitch, energy) variations, in which case the variations may be positioned in a manner appropriate to the story context or, on the other hand, follow a fixed rhythm (or a sing-song cadence like chanting). It is well known that non-native learners of a spoken language impart their native tongue cadences to their speech in the target language. The pattern of stresses is therefore affected and makes the speech difficult to comprehend.

Expressiveness is to be scored at 3 levels, viz., monotonous, sing-song style, and proper expressive reading. Proper expressive reading refers to expression in line with the text which can further be scored based on comparison with the canonical information structure to obtain an expressiveness rating. Expressiveness rating can be on the basis of pitch/energy/duration variation across sentences and words which will indicate the monotonicity. Sing-song or rhythmic style can be judged based on pitch and intensity contour periodicity which can be detected by periodicity strength detection via autocorrelation coefficient or equivalent methods. If desired, expressiveness can also be scored for a given affect associated with the text. Fluency ultimately is linked to the listener’s perception of the Comprehensibility (or ease of comprehending) the speaker. Correct word decoding, grouping of words into meaningful phrases in line with the linguistic structure and reading with expression makes the delivery close to natural speech and facilitates listener understanding of the text. Equivalently, lexical miscues and prosodic errors hamper the efficacy of information transfer. This can range from necessitating increased effort on the part of the listener to the complete break-down of communication. Confidence, on the other hand, is encoded in both voice quality and prosody. For example, soft and monotone speech conveys a lack of confidence. Soft speaking is detected via spectral cues such as spectral envelope tilt. It is an important goal of reading or speaking instruction to cultivate easily comprehensible and highly confident speakers.

Subjective ratings for each of perceived confidence and comprehensibility are obtained from language teachers, or similarly qualified experts, for each speaker- paragraph (about 3-4 sentences) presented to the rater in anonymized and randomized fashion. Machine learning methods, familiar to those skilled in the art, are used to leam the mapping from the extracted user attributes to the subjective ratings for comprehensibility and for confidence over the given datasets. The machine learning method can be drawn from a number of available models including neural networks. The classifier thus becomes adaptive based on the information it is fed in the training phase. In the present context, this information pertains to the different dimensions of fluency in the target population as determined by spoken language experts. The trained classifier so derived predicts the comprehensibility on a graded scale given an input audio recording with reference to the model text and information structure.

Language learners have the tendency to subconsciously apply the rhythm of their native language to other languages unless they are made explicitly aware of the differences. Hence, considering the value of pointed feedback in the assessment scenario, we have the provision to flag improper cadence (arising potentially from mother tongue influence). Similarly, the underlying causes of poor comprehensibility are available via the lexical and prosodic attribute ratings. Given the expansive set of features computed on the user audio recording, it is possible to further incorporate the detection of other inappropriate speaking styles (possibly other cultural carry-overs), such as, for example, uptalk.

Further according to an embodiment of the invention is disclosed a model Text Listening and Recording Screen. The model text to be read is presented to the user. The content of the text displayed could depend on the level of the user. Alternatively, the user could be allowed to choose a paragraph from a presented list of paragraphs of different difficulty levels. The text may be printed on paper or displayed on computer/android tablet screen. The text on screen may be displayed all at once or in a karaoke style with sequential word highlighting. Based on the target audience of the application, the user interface could be customized to make the reading process easier. For example, to make it engaging for younger children to read, the screen may include illustrations and highlight words as the child is speaking. Additionally, to encourage weaker children difficult words could be highlighted, the child could be allowed to listen to a model speaker or narrator before attempting to speak, as well as shadow along with the model speaker, all of which are known to be effective methods of reading instruction.

Figure 7 shows the model text for a short paragraph as displayed on a tablet with options to listen to the narrator or record the speaker for assessment.

Figure 8 shows a sample assessment report card for a user in accordance with an embodiment of the present invention. It is further broken down into a fluency feedback component and lexical and prosody feedback components. The fluency feedback component is linked to high level aspects of fluency in a language like comprehensibility, confidence, and cadence. The lexical feedback component includes feedback on parameters linked to the spoken transcript like Words Correct Per Minute.

The prosodic feedback component is linked to feedback on parameters like pace, phrasing, prominence, and intonation, linked to the comprehensibility of the utterance. Pace which is obtained using the speech rate is grouped into broad categories starting from very slow to normal to very fast. Speed consistency across the audio can also be indicated. The feedback on most of the other parameters is provided at 5 different levels: very poor, poor, average, good, very good and is indicated by corresponding emojis. The progress of the user using results of multiple attempts over time could also be indicated using bar charts.

As per the present invention, for self-training, the user is provided the functionality to listen to their recorded audio and compare it with the assessment. The model speaker audio is also made available to the user as a reference. There is also a provision to slow down the playback of this audio for beginners. This user audio could also be “corrected” to match the model template so that the user can listen to their own corrected voice for learning.

Additionally, detailed feedback can be availed by the user on each parameter. For example, the detailed lexical feedback as shown in figure 9 could display the total number of miscues to the user as well as indicate the words from the model text that were missed or substituted with other words, or additional words that were inserted by the user which were not present in the model text. The detected phrase breaks and prominent words can be displayed with coloured highlighting to indicate whether these match with the expected event locations or not. Detailed feedback on other parameters like speech rate could include a graph plotting the variation of speech rate over time.

The overall assessment as well as detailed feedback makes for a system that can serve a valuable role in programs requiring regular monitoring and evaluation of language and literacy interventions. It can also serve in a training context where the user gets immediate feedback followed by guidance on the next steps of required practice. Such a personalized tutor can serve as a powerful supplement to classroom instruction.

Finally, all the work described herein is applicable to reading instruction and training as well as to speech and drama instruction and training. It can also be easily be extended to the text-independent assessment scenario as long as an ASR engine is applied to the incoming audio recording. The model information structure is obtained by applying a part-of- speech (PoS) tagging to the detected words and forming references phrases and prominent words for comparison with the detected user attributes. Further, the statistical summary of detected pauses and pitch variations can be compared with expected norms for a fluent speaker of the language.

Some of the significant features and the non-limiting advantages of the invention over the prior art are mentioned below:

• The proposed solution forms an effective component in any testing of spoken communication skills. For instance, it may be used in oral reading assessment for children in primary school, a literacy assessment tool for the large-scale monitoring of basic education programmes, a practice tool in speech and drama training.

• For scenarios involving text independent case -the present disclosure provides additional statistics of pauses and durations and also about pitch variations and voice quality to be used for fluency scoring. The embodiments of this disclosure are not purely based on pronunciation of words, as it was the case in the prior art.

• The assessment of individuals could be used by the organization to obtain some trends for intervention such as specific cultural or native tongue influences.

• User video recording and the effects produced thereof with the present invention makes the process engaging.

• The system disclosed herein incorporates identifying the difficulty of the story using the text attributes and facilitates the personalization of the training to match individual skill levels.