Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATED SYSTEM FOR TRAINING ORAL LANGUAGE PROFICIENCY
Document Type and Number:
WIPO Patent Application WO/2013/172707
Kind Code:
A2
Abstract:
The present invention is in the field of automated systems and methods for improving of oral language proficiency. As a result of increasing internationalization there is a growing demand from education and business community for people who speak foreign languages well. An intelligible pronunciation is regarded as important for e.g. successful inter- action and social acceptance. However, an important problem is that oral proficiency training requires so much time, feedback and practice, that very often it cannot be sufficiently provided in traditional language classes. Thereto an automated system is provided.

Inventors:
STRIK WILHELMUS ALBERTUS JOHANNES (NL)
CUCCHIARINI CATIA (NL)
Application Number:
PCT/NL2013/050356
Publication Date:
November 21, 2013
Filing Date:
May 14, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
STICHTING KATHOLIEKE UNIV (NL)
International Classes:
G09B5/04
Foreign References:
US5679001A1997-10-21
US20060058996A12006-03-16
Attorney, Agent or Firm:
VOGELS, Leonard Johan Paul (XS Amsterdam, NL)
Download PDF:
Claims:
CLAIMS

1. Automated system for assisting real time training oral language proficiency of a user in a non-native language, comprising :

a) at least one means for receiving target audio input, such as a microphone,

c) a processor for capturing and processing input and providing output, such as a computer,

d) at least one means for providing output to the user, such as a speaker for providing audio feedback and a monitor for providing visual feedback,

characterised in that

b) stored on the system

i) first phase speech recognition software for determining audio input in a tolerant mode, the first phase speech recognition software providing input to second phase speech recognition software, and ii) second phase speech recognition software for determining audio input in a strict mode, comprising a pronunciation quality evaluation unit for pro- cessing input to determine potential difference between stored target pronunciation and actual audio input pronunciation, and for generating feedback output .

2. System according to claim 1, further comprising stored on the system one or more of b)

iii) a pronunciation error detector,

iv) a word stress error detector,

v) a morphology error detector,

vi) a syntax error detector,

vii) an interaction error detector,

viii) an intonation error detector

ix) a respiration error detector,

x) a formant error detector, and

xi) a selector for selecting a first phase speech

recognition software version and/or a second phase speech recognition software version, the version (s) being optimized for a group of users, and/or wherein input and/or output are in a second language and the user being native in a first language,

wherein the first and second language are selected from Indo- European languages, such as Spanish, English, Hindi, Portu- guese, Bengali, Russian, German, Marathi, French, Italian, Punjabi, Urdu, Dutch, German, French, Spanish, Italian,

Sino-Tibetan languages, such as Chinese,

Austro-Asiatic languages,

Austronesian languages,

Altaic languages,

such as wherein the first and second language are Dutch and English, Dutch and German, Dutch and Spanish, Dutch and Chinese, German and English, French and English, Chinese and English,

preferably wherein the second language is English, and vice versa,

wherein the first and second language are optionally the same, such as Dutch and Dutch.

3. System according to claim 1 or claim 2, wherein the pronunciation quality evaluation unit is adapted for one or more varieties and/or dialects, such as British English, American English, Australian English, Canadian English, New Zealandian English, Indian English, Limburgs, Brabants,

Gronings, and Drents.

4. System according to any of claims 1-3, wherein the pronunciation quality evaluation unit comprises software, wherein the software is preferably being stored on a computer.

5. System according to any of claims 1-4, further comprising one or more of a language model, a lexicon, a pho- neme model, one or more thresholds, one or more probability criteria, one or more random number generators, a level adjustment set-up, and a decoder.

6. System according to any of claims 1-5, further comprising one or more of a reference set of parameters, a fi- ne-tuning mechanism, a self-learning algorithm, a self- improvement algorithm, a selection means for selecting criteria,

a data base, wherein data is stored for one or more of pronunciation, word stress, intonation, and phoneme segmentation.

7. System according to any of claims 1-6, further comprising one or more decision trees, such as a decision tree being adapted to provide questions and responses thereto, and a decision tree being adapted to provide purposive training in view of second phase speech recognition.

8. Method for assisting automatic real time improvement of oral language proficiency using a system according to any of claims 1-7, comprising the steps of:

a) providing target audio input to a microphone, b) processing input with speech recognition software, c) wherein a computer is used for processing input and output,

d) providing feed-back, such as audio feed-back by a speaker, visual feedback by a monitor, and

e) providing automatic real time feed-back aimed at pronunciation improvement by a pronunciation quality evaluation unit .

9. Method according to claim 8, further providing a standardized score of oral language proficiency.

10. Method according to any of claims 8-9, further monitoring scores of users and relation between one or more users in a sequence of users.

11. System according to any of claims 1-7 and/or a method according to any of claims 8-10 for improving a non- mother language.

12. System according to any of claims 1-7 and 11 and/or a method according to any of claims 8-10 for use in medicine, such as in clinical or pre-clinical care.

13. System or method according to claim 12 for treating dysarthria, e.g. caused by CVA, a brain tumor, an acci- dent, ALS (Amyotrophic Lateral Sclerosis), a neurological disease, such as Parkinson's Disease, a disorder associated with the motoric nerve system, such as in logopedia, for improving eating performance, improving control of organs, such as tongue, for improving intelligibility, audibility, natural- ness, and/or efficiency of vocal communication.

Description:
Automated system for training oral language proficiency DESCRIPTION

FIELD OF THE INVENTION

The present invention is in the field of automated systems and methods for training of oral language proficiency.

BACKGROUND OF THE INVENTION

As a result of increasing globalization there is a growing demand from education and business community for peo- pie who speak foreign languages well. Intelligible pronunciation/speech in a second language (L2) is regarded as important for e.g. successful interaction and social acceptance. However, an important problem is that oral proficiency training requires so much time, feedback and practice, that very often it cannot be sufficiently provided in traditional language classes. For instance, Dutch students have problems with different aspects of English, especially the sound system, such as that different words sound similar. Besides pronunciation they often also have problems with grammar, vocabulary and sentence structure.

In general, it is considered that automatic phoneme recognition has proven to be a difficult technical problem in the field of speech recognition. Even the best automated systems only achieve limited error rates between 20% and 30% of phoneme errors made (only 20%-30% detection of errors) .

Typically a state of the art high-end computer program does not specially address oral proficiency skills in a second language and can not be used to support and improve language learning anytime and anywhere. Advanced, dedicated technology is not available to make this possible. Such leads to a lack of appropriate feedback and remedial exercises for a learner .

There are some rudimentary programs that do use computer assisted learning applications with automatic speech recognition (ASR) , but these only provide right/wrong feedback, and they generally do not provide control or checks. This kind of technology is not advanced yet. Feedback on pronunciation of a user may be provided through waveforms.

Various documents recite systems for improve- ment /training of oral proficiency skills and the like. For instance, US5679001 (A) recites a children's speech training aid which compares a child's speech with models of speech, stored as sub-word acoustic models, and a general speech model to give an indication of whether or not the child has spoken correctly. The aid requires an adult operator to enter the word to be tested into the training aid which then forms a model of that word from the stored sub-word speech models. The aid gives at the best an indication of correct pronunciation, based on a global approach, the indication being "correct" or "wrong". Such indication has no detailing of what (specific element (s) ) is actually wrong, only that it is wrong (not correct) . The indication may then be used by an operator (human being) to identify children in need of speech therapy (human assisted, ) (not real time) . In other words, only after a large number of very specific instances is provided, a specific instance can possibly be identified as relating to an issue needing therapy. The system does not provide real time feedback .

US2006058996 (Al)recites a system and method relating to voice recognition software, and more particularly to voice recognition tutoring software to assist in reading development. Such relates to nothing more and nothing less than a read out help and fluency of reading. Such a help tests knowledge of a language, rather than language proficiency. Based on a number of errors identified, the system may allow more errors to be made. No feedback is provided.

Some systems relate to overall and/or aggregate characteristics of a user's speech; as a consequence delayed feedback may be provided at best. For training purpose such is e.g. impractical. Also typically only one item at a time can be dealt with, such as mispronunciation of one consonant, such as "r", e.g. being typical for certain foreign speakers. Such is also not coherent with state of the art training methods, nor with perception which a user has of training.

Some documents explicitly mention that it is impossible to infallibly detect that an individual pronunciation error has been made by a speaker. As it is not possible to detect individual pronunciation errors with high precision, typical prior art systems draw aggregated conclusions about an average pronunciation error type a user makes based on an analysis of one or more recordings of the individual's speech. For instance, although for prior art systems it is difficult to accurately classify each phoneme produced by a speaker, over the course of the reading of a known passage, such sys- terns can determine a probability or certainty that a particular class of error is present in the speech, i.e. not an individual error.

Some recite a sort of speech recognition system. Such does not relate to improvement /training of oral language pro- ficiency. Further, such systems typically do not take a level of proficiency and/or accents or the like into account.

Some recite "schemes" for processing oral input, some recite hearing systems, some recite generic learning, and some recite formalistic approaches. Typically these are very gen- eral, and do not provide adequate details. Typical prior art documents do not provide reliable and reproducible results, for instance as underlying systems are not or poorly developed .

Typically prior art systems comprise only one or a few of necessary technology (modules), in order to perform adequately, e.g. in terms of teaching ability. Typically prior art systems relate to one aspect of language learning only. Typically there is no or limited correction of oral proficiency skills errors. If a comparison can be made, typically a correction would relate to identifying if pronunciation is

"wrong" or "correct", i.e. there is no underlying system for identifying further details, which details may be improved or which may be sufficient. Some prior art systems focus on only one item, such as improvement of improving pronunciation of vocals. Therefore a prior art system is typically also not capable of handling an accent of a user. For instance, for a given language pair, such as the Dutch-English language pair, there is no technology that can automatically handle non- native English with different (degrees of) Dutch accents.

There is therefore a need for the technology to be optimized for a specific language pair, i.e. Dutch-English.

Not only should a system on one hand be able to recognize all the (English) input, although spoken with many different (Dutch) accents, but on the other hand a system should also be able to detect pronunciation errors. Preferably all modules to detect errors need to be developed in order to provide adequate performance, and further these should be combined in an optimal way (design) to obtain a system that is suitable for practicing (English) oral proficiency skills. This requires not only the technology (the separate modules), which in itself is very challenging, but also a mix of expertise including knowledge about language acquisition, language teaching, software design, etc. The prior art systems do not meet all these requirements.

Also typical prior art systems can not be operated in real-time, thereby making interactive learning virtually impossible .

Further, users consider that a program must take the level of the user into account and automatically offer new ex- ercises.

The present invention therefore relates to a system and a method for automatic improvement of oral proficiency skills, which overcomes one or more of the above disadvantages, without jeopardizing functionality and advantages.

SUMMARY OF THE INVENTION

The present invention relates in a first aspect to an automated system according to claim 1 comprising various electronic elements. The automated system is typically implemented on a computer or the like. The present system is suited for real-time improvement of oral language proficiency.

In the present application the phrase "oral language proficiency" relates amongst others to communication per se, such as posing a (simple) question, obtaining an answer and interpreting the answer, morphology of words, syntax of a

(simple) sentence, pronunciation, e.g. using correct phonemes, and skills attributed thereto.

In general it is noted that Automatic Speech Recognition (ASR) is already quite challenging for native speech, but it is even more challenging for non-native speech, since non- native speech deviates substantially from native speech in at least three aspects: the sounds, lexicon, and grammar differ. As such the above mentioned prior art automated systems will achieve even more limited error rates (less than 20%) of speech errors made and are as such not applicable for non- native speakers. These above three aspects are directly related to three main components or knowledge sources of the present ASR system: acoustic models, lexicon, and language model. These components may use a chance algorithm in order to iden- tify one or more probable occurrences. The components are adapted and optimized in view of input, e.g. dialect, probability, context, etc. Further a sort of decoder is provided for interaction between the components, communication with an outside world, etc. In developing this technology, the present inventors took into account what is possible and what is not yet possible with state-of-the-art ASR. Since automatic recognition of all unconstrained, spontaneous non-native speech is not yet possible exercises in the present system have been constrained in such a way that they elicit speech that can be handled automatically with ASR, but are still suitable for language learning. In an academic context, there are a limited number of research groups world-wide that carry out research in this field. The present inventors are leading experts in this line of research and have been involved in research and related activities for many years. First of all, none of the other groups develop speech technology for language learning specifically for Dutch, neither for Dutch as a first language (LI) nor as a second language (L2) . Furthermore, the technology developed by these academic sites is generally intended for research alone, not for commercial or practical purposes. Finally, the present technology differs in many ways from others. As a consequence e.g. an error rate (identified speech inconsistencies ) of the present system is 80%-90%, as has been established upon evaluating the system with a significant num- ber of users. If the present system in use is optimized further, such as by detecting a mother language or dialect, the error rate increases to above 90%, in other words to a level at which the present system can be used in practice for training oral language proficiency.

The present inventors have carried out a considerable body of research into applying speech technology e.g. speech recognition technology to language learning and testing, specifically to learning Dutch as a second language (DL2), i.e. foreigners that are in the Netherlands and want to learn

Dutch. In this case L2 is Dutch, and LI can be many different languages. It is noted that since people with different Lis differ in the way they speak an L2 (here Dutch) , research carried out on Dutch was quite challenging, especially compared to a fixed language pair such as LI = Dutch and L2 = English. The present technology and products extend to other (combinations of) languages, e.g. French, German, or Spanish for Dutch students or Dutch (as L2) for students from other countries. It has been found that to a large extent the present technology can be ported from one language to the other. Thereto stud- ies were performed.

Desktop research was carried out to see whether information on porting speech technology between languages was available. Porting speech technology is taken to mean taking speech technology, e.g. speech recognition technology that was developed for recognizing speech in a first language (LI), and then applying it for recognizing speech in a second language. No relevant information was found on porting speech technology for use by non-native speakers. In addition, two pilot experiments were carried out on the feasibility of porting specific speech technology modules from Dutch to English (both as L2, as target language) .

An experiment concerned porting the present technology developed for detecting errors in the pronunciation of sounds for Dutch to English. First of all, speech recordings of Dutch students speaking English were collected and annotated in various ways: what was said (the words), how it was said (phonetic transcriptions providing information on how these words were pronounced) , and also which sounds were pronounced correctly and which ones incorrectly. In addition inventors implemented a so-called computer phonetic alphabet for English, an English lexicon, and acoustic models for non-native English. All these resources and information were used to develop and optimize speech technology for detecting errors in sounds in non-native English speech. An important aspect of this work was the identification of classifiers i.e. parameters that define particular errors made by non-native speakers when speaking a foreign language, in this example, Dutch people speaking English. Such required a lot of manual work. In addition experts on English were often consulted (learning) from Radboud In 'to Languages and the English Language Depart- ment of thereof.

A second experiment concerned porting the technology developed for detecting errors on prosody, e.g. intonation and word stress, for Dutch to English. Detecting word stress er- rors proved to be more complex than detecting errors in the pronunciation of sounds. Nevertheless, the experiences were similar to those in pilot experiment 1. Also in this case recordings and annotations were needed. Speech recordings can be the same as those used for pronunciation error detection, as the speech material was carefully designed in such a way that it was suitable for both purposes. However, additional annotations were necessary that indicate syllabification, word stress, and whether the words were pronounced with correct word stress or not. It has been found experimentally that this also required a lot of manual work. Besides the data (mainly recordings and annotations) mentioned above and the expertise of several persons, this also requires software to make recordings, annotations, analyses and training classifiers for word stress error detection. The adjustments were made in the software (in going from e.g. Dutch to English) were limited, and the expertise needed for carrying out the work described above was available with the present inventors. However, for every new language (pair) new data was collected and annotated. In addition, the material then is used to train, test, and optimize classifiers.

To summarize, it has been shown to be feasible to develop these classifiers, and tools (software) and expertise acquired for one language which has been shown to be very useful in developing the technology for other languages. Such has also been shown to speed up developing technology for new languages (pairs), i.e. the amount of time needed to develop technology for other languages gradually diminishes. Then, the main costs relate to those for collecting and annotating the speech recordings. The present invention provides very specif- ic, detailed and accurate feedback at sound level.

The present system is provided with a means for receiving audio input. The input is typically provided by the user, the user reading out loud a (target) text, the text being provided by the present system, giving an answer to a question posed, etc., such as in the form of spoken language. The target text and the like may be provided by an avatar. The present system may provide prompts. As such a user may select to repeat an exercise, hear back his/her own input, be provided with an example input, continue, etc. The example input may also be provided as a randomly provided sequence of words, which require a user to return a correct syntax. Likewise inflection may be trained. Spoken language is typically provided within an exercise, such as by reading out loud a word, a sentence and the like. A typical length of the present input is 10-250 phonemes, such as 50-100 phonemes. It has been established that especially from a learning efficiency point of view such a not too long and not too short length is preferred. Therefore the means typically relates to one or more microphones, directed to receive input. The one or microphones may be part of a further apparatus, such as a computer, a mobile phone, etc.

The present system is further provided with a processor for capturing and processing input and providing output, such as a CPU of a computer, a mobile phone, and the like. The processor may further comprise software, for performing one or more of detecting errors, determining input, providing output, reducing noise, improving signal to noise ratio, etc.

The system is also provided with at least one means for providing output to the user, such as a speaker for providing audio feedback, and a monitor for providing visual feedback. With the output means a user can hear back his (or her) captured spoken input, hear a target input, see (a representation of) errors made, etc. The present system is capable of producing feedback real time, typically within a few se- conds or less. It is noted that especially improved processors or faster processors may shorten the feedback even further, to less than one second; the present system has not been optimize yet in this respect.

Typically the present system is also provided with a means for interaction between a user and the system, such as a computer or the like, having a monitor, a means for scrolling, such as a mouse or the like, a means for providing text, such as a keyboard, etc.

The present (first and second phase) automated speech recognition software (ASR) may consist of a decoder (a search algorithm) and three 'knowledge sources': a language model, a lexicon, and at least one acoustic model. The language model (LM) contains probabilities of words and sequences of words. Acoustic models are models of how the sounds of a language are pronounced. The lexicon contains information on how the words are pronounced.

The present system may further comprise a first means for determining input, such as first phase (automated) speech recognition software, which software typically determines in- put in a tolerant mode, e.g. globally checking given (or actual) input versus required ( target ) input (the provided target text) . A goal is to recognize words a user intended to pronounce, even though the non-native speech of a user may deviate in various ways. The ASR system is optimized for this phase, e.g. by tuning the three knowledge sources using non- native speech.

The output of the first phase speech recognition software provides input to the second phase speech recognition software .

The system may further comprise a second means for determining input, such as second phase (automated) speech recognition software comprising a pronunciation quality evaluation unit for processing input to determine potential difference between target pronunciation and actual pronunciation, which unit functions in a detailed and strict manner. The manner may depend on the level of the user. The output of the first phase may be used as input, as well as the non-processed captured input .

In the second phase the system is strict. Now a goal is to detect errors, such as large deviations between pronunciation received and target pronunciation. A further version of the ASR system is used which is optimized for this task. The ASR system then segments the non-native speech signal, it detects the position (begin and endpoint) of the words and the phonemes (sounds) . This information is used for error analysis, e.g. to detect errors in the order of words (syntactic errors), and if phonemes are pronounced correctly. The user of the system can get immediate feedback on the errors made within a spoken utterance.

The system may further comprise various error detec- tors. These detectors relate to one or more of sounds and phonemes, lexicon, grammar, and prosody. Examples are a pronunciation error detector, a prosody error detector, e.g. a word stress error detector and an intonation error detector, a res- piration error detector, a formant error detector, and a grammar error detector, e.g. a morphology error detector and a syntax error detector, an interaction error detector, and a lexicon error detector. Typically these detectors are optimized, e.g. in view of first and second language, such as Dutch. Also these detectors may be provided in a training environment, such as the present My Pronunciation Coach® (MPC) .

The system may further comprise a selector for selecting a first phase speech recognition software version and/or a second phase speech recognition software version, the version (s) being optimized for a group of users. As such a user or a teacher may set a software version being specifically adapted to a level of oral language proficiency of a user, adapted to a native language of a user, adapted to a variety or dialect of a user, and combinations thereof.

In layman's language the present system captures analog input, transfers the input into a binary code, breaks up the binary input code, compares the input code with a target code thereby detecting differences between the two codes, and provides output to a user relating to the differences, the differences considered to be "errors". The present system is capable of detecting various types and occurrence of errors within the input and providing real time feedback to a user. If most (more than 80%) of the pronunciation errors made a user are detected by a system the system is considered to be good.

The present software and detectors are stored. They may be stored in any means capable of storage of binary data, such as RAM, a ROM, a hard-disk, a CD, a DVD, etc., and combinations thereof. The stored data should be accessible to the present system, when in use. It is noted that various elements of the present system may be located within one location, even within one apparatus, such as a computer, wherein e.g. software is loaded on memory, or located at different locations, such as on the internet, on a mobile phone, on a computer, at a learning center, and combinations thereof. Within e.g. a combination a first element may function as a client to a further element, an element may function as a server, etc. For some applications it is preferred to use a server or a cloud. A user may interact with a server or cloud as a client, e.g. a browser based client. Preferably a broadband connection between client and server or cloud is used, enabling fast communication of data.

A formant is considered as a concentration of acoustic energy around a particular frequency in the speech wave. The formants well represent vowel sounds. It has been established that formant frequencies changes during length of syllable. It is noted that formants depend on a person using speech, e.g. a man has a different set of formants than a woman, typically. Once the present system identifies the for- mants, it can correct for specific errors therein, by e.g. providing feedback to that end.

The present invention relates to a system comprising a sophisticated multi-component stored computer program, including various technologies needed, that students can use to practice a second language, such as English, and specifically pronunciation thereof. It has been developed over a long period of time, based on scientific insights and technology. The present invention also relates to a product comprising said system. Such a system may be referred to as a computer assist- ed language learning (CALL) or computer assisted oral proficiency training (CAPT) system. The present system may include the following functionalities, for which the required technology or modules (Mi-Mi) are than incorporated:

Ml. non-native automatic speech recognition (ASR) : e.g. recognize first to second language, e.g. Dutch-

English;

M2. pronunciation error detection (PED) : detect errors in the sounds (phonemes) produced;

M3. word stress error detection (WED); and

M4. intonation error detection (IED);

It is noted that within the present system all of the above, and optional further functionality, are carried out automatically, for e.g. non-native English with different (degrees of) Dutch accents, or non-native Dutch speakers. There is no need for a teacher or tutor. The present invention fulfills a need for Computer Assisted Language Learning (CALL) applications that make use of Automatic Speech Recognition (ASR) . CALL provides a private, stress-free environment in which students can access virtually unlimited input, practice at their own pace and, through the integration of ASR, receive individualized, instantaneous feedback anytime and anywhere. In an example the system is intended for Dutch students that want to learn English (fixed language pair) .

The present invention makes use of unique, advanced

ASR technology for accurate pronunciation error detection, developed by experts operating at the forefront of international research. This allows the system to offer new functionalities such as detailed and accurate phone specific corrective feed- back and related remedial exercises, which are not yet offered by other products, and certainly not with the degree of precision that is required for effective oral proficiency training and that the present technology can achieve. The present invention provides enabling technology modules that can be inte- grated into existing educational applications and courses. The present speech recognition and error analysis technology may be accessible through an application programming interface which connects via web services. The present invention provides an application that customers can use to develop cours- es. Customers can easily create courses with the authoring tools supplied in the framework. The framework application is built upon the technology module and available as software and as a service.

The present invention relates amongst others to a complete course based upon content from Radboud in 'to Languages. This ready-made course can be used by organizations to improve the learners' pronunciation skills. The course is modular and at present suited for levels from A2 to B2 (according to the Common European Framework of Reference for Languages, CEFR) . Further, the course, being interactive, can be adapted within its present framework to a need of a client, e.g. in terms of level.

The present invention provides products and services that generate leads, strengthens client relations (customer satisfaction) and improves the center of expertise. It also relates to an advice on policies and didactics: information and advice on necessity, added value and didactic applicability of ASR-based CALL. Further to implementation guidance/project management: well-planned and structured guidance to ensure organization-wide use of the products in line with strategic and didactic objectives of client. Also to training: to stimulate acceptance and use, and transfer our knowledge and experience. As noted, the present invention may be integrated into a client's ICT infrastructure.

The present invention provides a unique product- market combination. The market can be divided into various segments, e.g. the segments 1-4 being further detailed below.

1. Conventional education

Language teaching in conventional education institu- tions is typically based on core objectives, end terms or qualification profiles, which are (legally) embedded in a curriculum. Schoolbooks and digital courses from publishers are commonly used. For this market segment, a ready-made course is an interesting product; preferably courses based upon the methods from publishers the school uses. Additionally, the present framework will allow institutions to develop their own pronunciation courses.

2. Commercial language centers

These centers will be able to use present technology by integrating it in their own educational applications and courses. Additionally, present framework will allow them to develop their own courses.

3. Publishers

The present framework allows them to develop pronun- ciation courses that link up to their methods. Additionally, they have the option of offering content that users of the inventors My Pronunciation Coach (MPC) framework can assemble into pronunciation courses. An example of this would be a publisher supplying lists of words that a teacher can assemble into a course within the framework application.

4 Integration partners

Modular technology will be most interesting for this market segment e.g. because of the possibilities for integration within suppliers applications and courses. Additionally, the functionality of their existing applications can be ex- tended by linking them to MPC framework. For instance, a supplier of test software integrates the MPC tech to introduce new question types.

The present invention also relates to a so-called Software Development Kit: specifications and documentation of APIs and web services enabling third parties to develop their own tools and extensions, which can be plugged into the present framework. Such further relates to certifying, promoting and distributing add-ons within a user community. In addition, educational languages are developed games.

Unique product features are e.g. a personalized, accurate feedback on individual words and sounds, adaptive learning combined with remedial exercises, and individualized progress reports. The present system provides a "coach" for improving English proficiency with an automatic coach that listens." For instance, when using a monitor, feedback can be provided by highlighting parts (vowel, (composed) consonant, etc.) of a target text that contains an error. The present system can further be adapted to limit a number of errors fed back to a user. Such does for instance not demotivate a user. As such a filter is provided, the filter (stored on the present system) allowing only errors above a given threshold (set by user, coach, system) to be presented to a user. It is noted that in principle all errors made by a user can be fed back to the user in a real time modus, that is within seconds. Such is considered important for training, as a spoken sequence of words (one or more) of syllables is still in a user's memory when feedback is provided (an echo is as it were still in the ear); the feedback and spoken sequence can then be compared directly by a user. Such is impossible or at least impractical when feedback is provided much later, or when "overall" or aggregate feedback is provided (e.g. an "r" not being pronounced correctly in a sequence of sentences or even paragraphs) .

Within the present system an ASR software package SPRAAK has been used. It is freely available for non-profit research, and can also be used for commercial applications. The present system allows, however, for a switch to another speech recognizer system. Such another speech recognizer system can be implemented in a straightforward manner into the present system. Typically pronunciation errors are detected at the level of individual words and sounds with a high degree of accuracy thereby e.g. providing appropriate feedback and exercises.

The present system can be optimized for any language pair, following a procedure developed thereto. For instance for a specific language pair, Dutch-English, such a procedure is implemented, using expertise and data obtained for this language pair. Using expertise, data, and technology for other languages (esp. foreign-Dutch) the procedure is repeated.

Typically the present method and system uses a two- step approach, in which e.g. Dutch-English is recognized in a tolerant manner, typically relating to verification of (intended) expression, that is without overstressing (minor) mis- takes and reflecting thereon, and in a further step detect pronunciation errors, which is done in a strict way, wherein strictness may vary depending on level of expertise of a user. The level of expertise may be characterized according to the Common European Framework of Reference for Languages (CEFR: from Al, low level, to C2 advanced level) .

Experimentally modules were developed and tested in isolation, and later combined into a complete, suitable CALL system using a mix of expertise (as mentioned above) .

Above various problems were mentioned. Regarding three problems thereof the following is noted:

There is no CALL system for e.g. the Dutch-English language pair. In an example present system is optimized for this specific language pair (Dutch-English) .

It is not known to use a two-step procedure as indi- cated above to be used in a CALL system, such as for other language pairs.

Some of the modules Ml-4 (see above) seem to have been developed, albeit typically for different language pairs, but not for Dutch-English. However these have not been imple- mented into one combined system. Further they do not provide the present functionality. Also the prior art systems do not make use of a mix of expertise's, such as mentioned above, and further expertise provided by tutors, language professionals, e.g. relating to learning algorithms.

The present invention has been applied to amongst others native Dutch (especially word stress and intonation), and non-native Dutch users, for foreigners (with many different nationalities) learning Dutch. The latter are referred to as foreign-Dutch. For foreign-Dutch, a complete system is de- veloped, tested, and it has been established that it is suitable and effective for language learners in general. It has been found that the technology and expertise acquired for the foreign-Dutch case, is in principle transferable to any other language pair, such as localized Dutch-English.

The present system targets amongst other correction of prosody (appropriate emphasis and inflection), deficits in rate of articulation, intensity, formant and phonation (control of vocal folds for appropriate voice quality and valving of airway) . These treatments may involve exercises to increase strength and control over articulator muscles, and using alternate speaking techniques to increase speaker intelligibility.

The present system may be accessible on internet, on a hard disk of a computer, on a DVD, a CD-ROM, etc.

The system may be used in Computer Assisted Language Learning (CALL), in Computer Assisted Learning ( CAL ) , in Computer Assisted (aided) Instruction (CAI), in Computer Assisted Pronunciation Training (CAPT), in improving any language proficiency, etc. A lack of proficiency may be caused by a human body deficiency, such as caused by an accident, being present from birth, etc.

Thereby the present invention provides a solution to one or more of the above mentioned problems, by providing an extended system, comprising various functionalities, wherein the functionalities are further optimized with respect to each other, thereby further improving functionality and user friendliness .

Advantages of the present description are detailed throughout the description.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates in a first aspect to an automated system for improvement of oral language proficiency according to claim 1.

In an example the present invention relates to a system wherein input and/or output are in a second language and the user being native in a first language, wherein the first and second language are selected from Indo-European languages, such as Spanish, English, Hindi, Portuguese, Bengali, Russian, German, Marathi, French, Italian, Punjabi, Urdu, Dutch, German, French, Spanish, Italian, Sino-Tibetan languages, such as Chinese, Austro-Asiatic languages, Austrone- sian languages, Altaic languages, such as wherein the first and second language are Dutch and English, Dutch and German, Dutch and Spanish, Dutch and Chinese, German and English, French and English, Chinese and English, preferably wherein the second language is a foreign language such as English, and vice versa, wherein the first and second language are optionally the same, such as Dutch and Dutch.

It is noted that the present system allows for the first and a second language to be the same, e.g. Dutch and

Dutch, or to be different. Such is typically the case for e.g. persons having a disorder or the like, for persons having Dutch as a non-mother language, persons having communicative limitations in general, and persons less skilled in the Dutch language. In such and similar cases especially an adapted first phase speech recognition software may be used, whereas the second phase speech recognition software can remain

(largely) the same. In a sort of overview mode a selection option may be introduced. In the overview mode, based on e.g. probabilities of errors detected, (lack of) quality of input provided by and capabilities of a user, etc., an opportunity is introduced to set and adapt criteria used in the second phase, e.g. somewhat more focused on specific instances (such as a phoneme) or somewhat less focused.

In an example the first language may be selected from

Dutch, German, French, Spanish, Italian, Polish, Chinese, Japanese, Korean, Afrikaans, and English.

In an example the second language may be selected from Dutch, German, French, Spanish, Italian, Polish, Chinese, Japanese, Korean, Afrikaans, and English.

Also varieties of the above languages may be selected, such as British English, American English, Australian English, Canadian English, New Zealandian English, Indian English, etc.

Further, the present system is also adapted to process dia- lects, such as Dutch dialects and varieties, such as wherein the pronunciation quality evaluation unit is adapted for one or more varieties and/or dialects, such as British English, Limburgs, Brabants, Gronings, and Drenths . Clearly such can only be achieved after gathering data, analyzing data, ordering data, etc. as described throughout the description.

Likewise the system may also be used to learn a language, not being a mother language, e.g. a foreign language, when staying abroad, such as when immigrating, for study, for work, etc. For example, the present system may be used to learn Dutch as a second language for e.g. people form Turkey, Morocco, Suriname, the Dutch Antilles, Germany, Great Britain, Poland, etc. As such the present invention is widely applicable.

In an example the present invention relates to a system wherein the pronunciation quality evaluation unit comprises software, wherein the software is preferably being stored on a computer.

In an example the present invention relates to a sys- tern further comprising one or more of a language model, a lexicon, a phoneme model, one or more thresholds, one or more probability criteria, one or more random number generators, a level adjustment set-up, and a decoder, wherein the decoder may comprise the previous elements.

In an example the present invention relates to a system further comprising one or more of a reference set of parameters, a fine-tuning mechanism, a self-learning algorithm, a self-improvement algorithm, and a selection means for selecting criteria. The parameters may for instance relate to one or more classifiers, as well as to (implementing) algorithms, e.g. for determining a probability.

In an example the present invention relates to a system further comprising a data base, wherein data is stored for one or more of pronunciation, word stress, intonation, and phoneme segmentation. It is noted that the present data base comprises an extensive amount of data, gathered throughout the years .

In an example a reference set of parameters or classifiers relate for instance to selecting relevant parameters first, than identifying a sub-set thereof to be taken into ac- count, further identifying a cutoff value, for instance below which no action is taken. Values and parameters may vary in view of a level of a user. For instance, in Europe users are categorized from Al, being the lowest level, towards C2, being the most advanced level. Based on the level of the user, feedback, use of parameters etc. may e.g. be more or less stringent, that is in view of objectives. As such the present system may be fine-tuned to the level of a user. Also input from professionals is taken into account.

As such various levels, e.g. in view of the above parameters, may be distinguished, such as for advanced learners and for beginners. Feedback is provided at an expected level.

The present database is filled with a huge amount of information, such as spoken sentences, models, etc. Further, the data is organized, e.g. automatically by self-learning software, such as probabilistic software.

Even further, input from professionals in the field, e.g. tutors, is incorporated in the database.

In an example the present system further comprises one or more decision trees, stored on the system, such as a decision tree being adapted to provide questions and responses thereto, a decision tree being adapted to provide purposive training in view of second phase speech recognition. An example for a decision tree is a job interview. A user is e.g. asked (general) questions relating to various aspects of the job and towards the users background. An example may relate to a route to be followed, e.g. towards a museum in a city. In general the decision tree may relate to a Quest. The decision tree may be specifically adapted towards error detection in the second phase, assuming a user is sufficiently proficient.

As such fine tuning of training can be established. It is noted that when using an algorithm that detects global errors it is impossible to have a system comprising a decision tree, e.g. as no real time feedback can be given. The decision tree may identify if a user makes errors or not (oral proficient or not) . Upon identifying errors (or not) a further question may be asked, or not. As such a user "moves" through a decision tree and progress of a user can be monitored. The interaction becomes much more vivid.

The present invention relates in a second aspect to a method for assisted automatic improvement of oral language proficiency according to claim 8.

It is noted that feedback may be provided in various ways, as indicated. Such may depend on optimal efficiency of a learning method. Feedback may be in a form wherein the input is fed back, but also in a form wherein improved pronunciation is provided, such as by repeating (part) of the input, in an optimized manner however.

As such also a first phase may be repeated various times, before user enters a second phase. Likewise a second phase may be repeated, without entering again in the first phase .

In an example of the present method further a standardized score of oral language proficiency may be provided, such as for monitoring and evaluating progress.

In an example of the present method provides monitoring scores of users and relation between one or more users in a sequence of users.

In an example the present technology is used to pro- vide a conversation with the student (or user) . The conversation is a network or tree of questions and answers. The progress and outcome of the conversation depends on the answers and the quality of the answers that are provided by the student. This gives the student more interaction with the soft- ware while learning a language. This stimulates a natural way of language learning. Besides the interaction, it provides a way to monitor and evaluate the level of the student and the progress that is made. The monitor and evaluation unit is built in the system. The gathered data from the monitor and evaluation can be linked to standards for (foreign) language Learning or can be provided to organizations. For example to Human Resource Management and Marketing of companies and Universities. If required maintaining privacy of the student is provided by e.g. anonymising data. The conversation itself can be used as a marketing tool for organizations. The scenery can be adjusted so the student learns a new language while the student gets acquainted with the organization that promotes itself in the conversation. An example is a job interview at a specific organization or finding your way through a city while promoting the organizations visually and with text. Also, the outcome of the monitor and evaluation can be linked to a reward for the student. The reward is based on the number of new students that are recruited by the student and to the output of the monitor and evaluation. As such the present method and system may link standards for language learning, HRM and marketing tools and standards of an organization and rewards to one and another .

The present invention relates in a third aspect to a system according to the invention and/or a method according to the invention for improving a non-mother language. Such is especially relevant for immigrants, e.g. working in a country, and children thereof. It is noted that children of immigrants speak another language e.g. at home than that of the country they are living in. As such the other language can be consid- ered as a second language, wherein especially the children are less proficient. Therefore there is a need for a system or method for improving the non-mother language. It is preferred to use adapted acoustic models, dedicated specifically to children .

The present invention relates in a fourth aspect to a system according to the invention and/or a method according to the invention for use in medicine. It is preferred to use adapted acoustic models, dedicated specifically to a disorder to be treated.

Specifically the present invention may be used to improved speech of a patient. It is noted that speech may be hampered for various reasons; some of these are given below.

The present system may be amended slightly in order to improve treatment results. For instance number of repeti- tions may be altered, typically increased. Also a larger number of similar exercises may be provided to a patient in need thereof. Even further the first means for determining input may be set more tolerant, e.g. in that more often input is accepted as sufficient. Even further besides the second means for determining input may also be set more tolerant. A person using the present system for treatment may have different objectives, e.g. less stringent in certain aspects and more stringent in other aspects. Even further an intermediate means for determining input may be provided. If it is expected that a patient will recover relatively slowly, such a further third means may provide an intermediate level to be reached.

For example, dysarthria is a motor speech disorder believed to result from (neurological) injury of a motor component of a motor-speech system and is amongst others charac- terized by poor articulation of phonemes. Any of a speech subsystem (such as respiration, phonation, formant, prosody, and articulation) can be affected, leading to impairments in intelligibility, audibility, naturalness, and efficiency of vocal communication. It is noted that also one or more of speech subsystems may be improved by the present invention.

In the case of neurological injury due to damage in the central or peripheral nervous system such may result in e.g. weakness, paralysis, and lack of coordination of the above motor-speech system, producing e.g. dysarthria. These effects in turn hinder for example control over tongue, throat, lips or lungs and swallowing problems (dysphagia) are also often present. In this respect the present invention is also aimed at improving e.g. weakness, paralysis and coordination.

It is noted that typically the term dysarthria does not include speech disorders from structural abnormalities, such as cleft palate, and must not be confused with apraxia, which refers to problems in the planning and programming aspect of the motor-speech system. It is noted that other speech disorders, such as the ones mentioned before, may also be improved by the present invention.

As such the present invention is also aimed at improving functionality of e.g. the nerve system, e.g. in terms of restoring nerve paths. For instance, functionality of cra- nial nerves that control muscles, e.g. relating to trigeminal nerve's motor branch, facial nerve, glossopharyngeal nerve, vagus nerve, and hypoglossal nerve.

The present invention is also aimed at improving functionality relating to specific dysarthria ' s , such as spas- tic, flaccid, ataxic, unilateral upper motor neuron, hyperkinetic and hypokinetic, such as in Huntington's disease or Parkinsonism, and mixed dysarthria ' s . The above disorders may be of severe and mild nature. It is noted that dysarthria patients are often diagnosed as having 'mixed' dysarthria. Neu- ral damage resulting in dysarthria is rarely contained to one part of the nervous system — for example, multiple strokes, traumatic brain injury, and some kinds of degenerative illnesses often damage many different sectors of the nervous system, causing mixed dysarthria ' s .

It is noted that dysarthria may sometimes also affect a single system. Severity ranges from occasional articulation difficulties to verbal speech that is completely unintelligible.

Individuals with dysarthria may encounter difficul- ties relating to e.g. pitch, vocal quality, speed, volume, breath control, strength, range, timing, steadiness and tone. Examples of specific effects include irregular breakdown of articulation, distorted vowels, a continuous breathy voice, monopitch, word flow without pauses, and hyper nasality. Such may also be the case for a user in a second language as an immigrant .

It is noted that causes of dysarthria and the like can be many such as Huntington's disease, Parkinsonism,

Niemann Pick disease, Ataxia, ALS, trauma, thrombosis, injury embolic stroke, etc.

Articulation problems resulting from dysarthria are treated by speech language pathologists, using a variety of techniques, however not using the present system.

The present system is also aimed at use for improving eating performance, such as by improving control of organs, such as tongue, lips, swallowing, etc. If control of these organs is improved it becomes easier to e.g. eat. Such problems may for instance specifically occur with elderly people. The present system may further be supported by (slightly) changing characteristics of food, such as palatability, viscosity, etc., making it easier for a person to intake food. Various side effects, also optionally present as such, such as intelligibility, audibility, naturalness, and/or efficiency of vocal communication may be improved by the present system.

The present system and method may be used in more recent techniques based on the principles of motor learning (PML) . Further devices may support speech, such as Augmentative and Alternative Communication (AAC) devices that make coping with a dysarthria easier, which may include speech syn- thesis and text-based telephones. EXAMPLES

The invention is further detailed by the accompany ing examples, which are exemplary and explanatory of nature and are not limiting the scope of the invention. To the per son skilled in the art it may be clear that many variants, being obvious or not, may be conceivable falling within the scope of protection, defined by the present claims.

In an example the present invention provides an overview of most important pronunciation errors and progres sion in time. Tables 1 (vowels) and 2 (consonants) show in column 1 the British English (RP) sound, followed by the condition or the context in which the error occurs, in column 3 the sound often pronounced by Dutch speakers, and in column 4 some example words. If there is no condition speci fied, the error can be applied to all conditions.

Table 1: Vowel Errors in Dutch English Pronunciation

Table 2: Consonant Errors in Dutch English Pronunciation

In an example a solution is provided for Dutch students which have problems with different aspects of the English sound system, for instance with final devoicing, aspi- ration, dental fricatives and the pronunciation of some vowels.

Examples of these errors are provided in tables 1 and 2. Examples for consonant errors in Table 2: final de- voicing errors in rows 1, 2, and 3, e.g. the last sound of the word 'bad' being pronounced as /t/; aspiration errors in row 4, e.g. the first sound of the word 'cap' being pronounced as /k/; and dental fricative errors in row 9, e.g. the first sound of the word 'this' being pronounced as /d/. A frequent vowel error is shown in row 3 of table 1, the first sound of the word 'unwise' being pronounced as /□/ .

In a way, the example relates to a (scientific and development) process wherein an inventory of frequent errors made by Dutch people learning English is made, an inventory of existing technology for handling non-native speech (for speech recognition, assessment, error detection, etc.) is made, feasibility of porting technology from one language to the other is investigated, conducting pilot experiments, such as porting the present technology developed for detect- ing pronunciation errors in Dutch to English, and conducting further pilot experiments, such as porting the present technology developed for detecting word stress errors in Dutch to English. Therein typically input is first determined tolerantly, and thereafter more strict. It is noted that a fur- ther similar sub-division may be provided by the present system, and even further that application of determination may vary throughout use of the present system, e.g. at one point being (somewhat) more tolerant, than being (somewhat) more strict, than even more strict, and then somewhat less strict. Other components mentioned above are ported as well, in a later stage.

In the above feasibility study an inventory was drawn up of errors that should be addressed in a training program. The selected errors are based on research data and on Radboud in'to Languages 's teaching experience throughout the Netherlands. The relevance of the selected errors is regarded not only dependent on an effect mispronunciation can have on intelligibility, but also on a possible negative attitude a Dutch English pronunciation can evoke. The above survey of existing technology for handling non-native speech has revealed that research on ASR for non-native speech is carried out by a limited number of academic sites world-wide. On the market there are few prod- ucts that employ ASR, usually with limited functionalities and for different target groups than those addressed by the present invention. There are some products on the market that make use of speech technology; some products even purport to employ ASR, although this cannot always be ascer- tained. However, the present survey of tests and demos has made clear that most of these products do not really make use of ASR and certainly do not use the present advanced ASR technology for error detection that allows providing feedback on the correctness of e.g. individual sounds, prosody, etc.

The above feasibility study has shown that expertise acquired in developing classifiers and tools (software) for one language is very useful in developing similar tools for other languages, and this also speed up the development of the present technology for those new languages, i.e. the amount of time needed will gradually diminish. A main effort therein is for collecting and annotating the speech recordings required.

The present system makes it possible to optimize for a given language pair by: focus on errors made by learners, optimizing technology for detecting these errors, providing suitable exercises for practicing the problematic aspects, etc. The tasks language learners have to perform are, for instance, read utterances aloud, listen to utter- ances produced by the present system and then repeat (produce) these utterances, and so-called shadowing (i.e. listen to utterances, and repeat them while they are produced, with only a short delay) . The level of difficulty of these tasks will gradually increase and adapt to the proficiency level of the student. For these tasks it is known what the students should say; however, since what they actually produce could be different, the technology is able to verify whether the learner was making a serious attempt to produce the utterance in the task or whether (s)he was trying to fool the system. To this end utterance verification algorithms are employed. In all cases mentioned above, the technology should be able to cope with English spoken with e.g. many different Dutch accents at different levels. This is a difficult and challenging task that requires dedicated technol- ogy optimized for this specific goal.

With the present system students can practice their pronunciation in English: they produce utterances, the system assesses their pronunciation, checks whether sounds were pronounced incorrectly, provides feedback on errors detect- ed, and suggests appropriate exercises for improvement. Dedicated technology is developed to perform these tasks automatically through a computer program. Since the system has to cope with e.g. English spoken with a whole range of e.g. Dutch accents, this has been a challenging task requiring innovative technology, developed and optimized for this specific task. The system can be web-based, providing students an opportunity to use it anytime and anywhere they want. The present system relates to a high-end product that educational institutions teaching e.g. English can use to support teachers in providing feedback to their e.g. Dutch students on their pronunciation of the English language.

The present technology relates amongst others to error detection at word level, such as speech verification, segment error detection, API-definition, and XML-represen- tation, and error detection at utterance level, such as speech verification, segment error detection, API- definition, XML-representation.