Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CONFIRMATION METHOD THROUGH SPEECH SYNTHESIS IN AUTOMATIC DICTATION SYSTEMS AND A SYSTEM FOR THE APPLICATION OF THIS METHOD
Document Type and Number:
WIPO Patent Application WO/2012/011885
Kind Code:
A1
Abstract:
The invention relates to a method for creating a confirmation mechanism in automatic dictation systems by using speech synthesis (Text-to-Speech-TTS) and feature of segmentation in addition to speech recognition (SR) (Speech Recognition - SR). The invention relates to a system equipped with at least one speech recognition (SR) module (23) converting the words of the users (21) to text by automatically recognising them, a microphone (22) providing input to this module, at least one monitor (24) on which the text can be displayed and edited, belonging to at least one device wherein the dictation system operates, at least one speech synthesis (TTS) module converting the text created as a result of automatic speech recognition to audio output (25), a segmentation module (27) dividing the output created through the speech synthesis into its parts when necessary and a headphone (26) transmitting these outputs to the user (21).

Inventors:
ARSLAN MUSTAFA LEVENT (TR)
Application Number:
PCT/TR2011/000175
Publication Date:
January 26, 2012
Filing Date:
July 22, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SESTEK SES VE ILETISIM BILGISAYAR TEKNOLOJILERI SAN TIC A S (TR)
ARSLAN MUSTAFA LEVENT (TR)
International Classes:
G10L13/04; G10L15/26; G10L15/22
Foreign References:
US6490563B22002-12-03
US20050038652A12005-02-17
EP0389514A11990-10-03
US20090187406A12009-07-23
US7480613B22009-01-20
US6490563B22002-12-03
Attorney, Agent or Firm:
DESTEK PATENT, INC. (Zindankapi Sk.No.10 Osmangazi, Bursa, TR)
Download PDF:
Claims:
CLAIMS

The invention is a dictation system comprising; at least one speech recognition (SR) module (23) converting the words of the users (21 ) to text by automatically recognising them, a microphone (22) providing input to this module by transmitting the user's (21 ) voice, at least one device wherein the dictation system operates and at least one monitor (24) belonging to this device, wherein the text can be displayed and edited,

at least one speech synthesis (TTS) module converting the text created as a result of automatic speech recognition to audio output (25), and

a headphone (26) transmitting these outputs to the user (21 ); and it is characterised in that; while the automatic speech recognition process is in progress, the recognition result is read through speech synthesis (TTS) simultaneously during dictation pauses, it is caused to be listened to by the user (21 ) through the headphone (26) for control purpose.

A system according to Claim 1 , and it is characterised in that; what is said is transmitted through a microphone (22) to at least one speech recognition (SR) module (23), by putting to text therein, it is displayed on at least one device's monitor (24) wherein the dictation system is established and sounding of this text created by speech synthesis (TTS) module (25) while the process of dictating is in progress synchronising the same time with display of this text on the monitor (24).

A system according to Claim 2, and it is characterised in that;, by causing the recognition result that has just been converted to text to be listened to by the user (21 ) through speech synthesis (TTS) module (25) and a headphone (26) during the 'rest's and times of pause the user (21 ) gives during the pauses between the phrases he utters subsequently when speaking and thus, since the user (21 ) can listen to the phrases he has just dictated right away, errors can be differentiated better.

A dictation system comprising;

at least one speech recognition (SR) module (23) converting the words of the users (21 ) to text by automatically recognising them,

a microphone (22) providing input to this module by transmitting the user's (21 ) voice, at least one device wherein the dictation system operates and at least one monitor (24), belonging to this device, wherein the text can be displayed and edited,

at least one speech synthesis (TTS) module converting the text created as a result of automatic speech recognition to audio output (25),

a segmentation module (27) dividing the output created through speech synthesis into its fragments and

a headphone (26) transmitting these outputs to the user (21 );

and it is characterised in that;

after all the whole text is created following completion of automatic speech recognition process (after dictation), recognition result is read through the speech synthesis (TTS) and is caused to be listened to by the user (21 ) through a headphone (26) in logical parts for control purposes.

5. A system according to Claim 4, and it is characterised in that; what is said is transmitted to at least one speech recognition (SR) module (23) through a microphone, it is displayed on the monitor (24) of at least one device by being converted to text therein and after the dictating process is completed, the user (21) who will be checking the text can control the completed whole text by reading from monitor (24) in a manner similar to the one mentioned above and/or listening to the output of speech synthesis (TTS) module (25) through a headphone.

6. A system according to Claim 5, and it is characterised in that; the process of listening is facilitated by dividing the audio recordings created through the speech synthesis (TTS) into their parts by segmentation module (27) properly and the user (21 ) who subsequently listens to the segmented parts instead of listening to the whole text at a time can notice errors more comfortably.

7. A system according to any one of the Claims above, and it is characterised in that; synthesising is done much faster than normal speed and thus one does not need to give long pauses during dictation.

8. A system according to any one of the Claims above, and it is characterised in that; by using a headset (26) instead of loudspeaker, the dictating process is prevented from being affected adversely.

Description:
DESCRIPTION

CONFIRMATION METHOD THROUGH SPEECH SYNTHESIS IN AUTOMATIC DICTATION SYSTEMS AND A SYSTEM FOR THE APPLICATION OF THIS METHOD

The Related Art

The invention relates to a method for creating a confirmation mechanism by using speech synthesis (Text-to-Speech-TTS) based on speech recognition (Speech Recognition-SR) technology and when necessary through segmentation of records.

The invention relates, in particular, to a dictation system based on getting confirmation by having customers listen to an automatically recognized text through speech synthesis (TTS) during process of dictating and thus receiving feedback with sound without having to look at the monitor and relying on speech recognition (SR) providing correctness of the words/word groups recognised.

The Prior Art

In many areas of business life and everyday life (for example, taking the minutes in meetings or preparing medical reports, etc.) written documents need to be created and stored; and in a great many of cases wherein this requirement is also felt, these records are obtained through the transfer of verbal expressions to text in various ways.

For many years, this process has come to be performed by either putting the words being uttered during speech to text simultaneously (dictating) or through the transcription of sound recordings recorded in an audio recorder afterwards. Both methods having some cons and pros against each other require extremely intensive labour and time.

In the recent past in the light of developments in audio technology, systems automatically converting speech to text started to be developed. In these systems which are based on speech recognition technology, what is said by the user speaking into a microphone is perceived by a (usually computerised) speech recognition (SR) module (operating in computer environment in general) and is converted to text. However, in automatic speech recognition systems, correct recognition performance is never 100%. Therefore, during dictation, the user has to check whether what he is saying is being recognized correctly or not.

In a sample of existing systems shown in Figure - 1 , the audio input received through microphone (12) is converted to text by the speech recognition module (13). The user (11 ) follows this text created as a result of automatic speech recognition from the monitor (14) and thus can control whether what he is saying is correctly recognized or not. The user's reading of what is written both slows down the process of dictation (or elongates the control process after process of dictating) and makes it impossible for the user to focus on another work while he is speaking since it takes all the attention of the user away.

An error that the said speech recognition systems fall into on the other hand is recognition of another word, very similar to the one being said, instead of the word being uttered. In such a case, the user may not notice these minor mistakes when reading the text from the monitor. In texts where the result of dictation required be correct is of crucial importance (for example in medical field), these minor mistakes can create serious problems.

Users also benefit from spell checking (spell - check) feature in some cases to correct the text occurring after dictation in text editing (text editor) applications. However, this feature can not support some languages (e.g.: Turkish, Asiatic languages, etc.) with full efficiency; and this can lead to missing recognition errors on part of the user relying on the language support. In cases where the support is available on the other hand, an error that the automatic recognition module will make the wrong choice between two words that area in the dictionary being similar acoustically may pose a problem since it will not also be perceived by this feature.

The problems mentioned have also been noticed in the studies in literature and the subject has been tried to be approached from various perspectives. One of these approaches on the other hand involves Control of the text dictated is not only done by reading from the monitor but at the same time by listening to it. In the prior art, in the patent document numbered US7480613; it is revealed as a method that verbal expressions of the user are also recorded as audio recording during dictation, these being played back at different speeds afterwards for purpose of confirmation. For recognition results found as reliable above a reliability level determined by statistical methods, relevant parts of these original audio recordings belonging to the speaker are played back rapidly, for those recognition results found to be below slowly. As the reliability level increases/decreases, the playback speed of relevant passages also increases/decreases. Thus, the user can have a indefinite, tentative idea related to recognition accuracy in different parts of the text, based on this reliability score.

In the patent document numbered US6490563 which addresses a similar subject on the other hand, a dictation system consisting of a speech recognition (SR) module that receives the audio signal input from the user through a microphone and converts the speech to text and a speech synthesis (TTS) module that converts the created text again to an audio signal output with the help of a loudspeaker is contained (Claim 23). It is expressed that the text parts to be sounded through speech synthesis are selected with the indicators defined in various form.

Due to shortcomings in the prior arts and approaches, a novelty in automatic dictation systems is having been sought.

Brief Description of the Invention

The purpose of the invention on the basis of this position is to reveal a system providing confirmation of whether recognition is made accurately or not in very serial and easy manner through by adding the speech synthesis (TTS) and segmentation modules to the automatic dictation system in addition to speech recognition (SR). The purpose of the invention is that recognition result is read through speech synthesis (TTS) and can be listened to by the user concurrently in-between dictation or after dictation while automatic speech recognition is in progress.

Another purpose of the invention is that the user can handle with other matters of him, during or after dictation without having to listen to the text created from the audio tape coming from the microphone through the speech recognition (SR) module; for example, a doctor making a dictation in medical field can on the other hand examine the x-ray results of his patient. In this way, getting feedback auditory in cases where one does not desire or it is not possible to look at the monitor of the unit to which dictation is made, or two-way, both visual and auditory feedback in cases where one can look at the monitor with regard to the result of automatic speech recognition and that with the help of these feedbacks, recognition errors can be easily noticed and eliminated.

A further purpose of the invention, if a spell-check feature contained in modern text editing applications is being used when speeches are being converted to text, is to provide quick detection of these errors in cases wherein this feature remains inadequate or causes errors. A further purpose of the invention is that any type of recognition results which might be obtained in case a wide vocabulary recognition is made thanks to use of speech synthesis (TTS) can be read out to the user.

A further purpose of the invention is to provide the user with listening the speech (TTS) synthesis output for confirmation purpose through headphones instead of loudspeakers. Thus, both the elimination of the noises in the medium and prevention the speech synthesis output from affecting the process of dictating negatively since it will be interfered with the speech recognition input if a loudspeaker is used are aimed.

A further purpose of the invention is to make it possible by reading the result of recognition between the speech intervals by means of adjustment of the speech synthesis (TTS) for the purpose of confirmation/control and thanks to the realisation of process of dictating both with 100% accuracy and in a very short time.

In a preferred embodiment of the invention, it is aimed that the auditory feedback can be used not during dictation but also after dictation for control purpose. In this application, it is aimed to turn the audio recordings which are created according to the text to speech synthesis method into a state by dividing (segmenting) them into their fragments wherein they can be followed more comfortably and providing easier controlling of them.

The present invention which is mentioned above and will be detailed below brings along many conveniences as of its properties. With the presence of the said system, a system providing customer satisfaction by dramatically reducing time and workforce requirements for dictation experience of the user with the feedback and confirmation methods with high accuracy is revealed.

While the characteristics and the advantages provided by the invention is being evaluated, taking the figures below and the detailed descriptions written through references to these figures into account will be useful in terms of understanding the invention more clearly.

Description of the Figures

Figure- 1 is a schematic view indicating an application belonging to the existing

system.

Figure - 2 is a schematic view of the system which is the subject of the invention.

Reference Numbers

11. User

12. Microphone

13. Speech Recognition (SR) module

14. Monitor

21. User

22. Microphone

23. Speech Recognition (SR) module

24. Monitor

25. Speech synthesis (TTS) module

26. Headphone

27. Segmentation module

Detailed Description of the Invention

In this detailed description, the preferred embodiments of the confirmation system through speech synthesis in dictation systems based on automatic speech recognition, which is the subject of the invention, are disclosed only for better understanding of the subject, and may not constitute any restrictive effect. The invention relates to system and method having automatically recognized text listened to by customers through speech synthesis (TTS) in dictation systems based on speech recognition (SR) and thus providing confirmation mechanism.

In Figure - 2, a schematic view of the system which is the subject of the invention is illustrated. The said system consists of the main components of at least one speech recognition (SR) module (23) converting the words of at least one user (21 ) to text by automatically recognizing them, a microphone (22) providing input to this module by transmitting the user's (21 ) voice, at least one device wherein the dictation system operates and at least one monitor (24) belonging to this device, wherein the text can be displayed and edited, at least one speech synthesis (TTS) module converting the text created as a result of automatic recognition to audio output (25), a segmentation module (27) dividing the output created through speech synthesis into its fragments when necessary and a headphone (26) providing that these outputs can be listened to by the user (21 ).

In applications of the system being the subject of the invention, what is uttered by the user (21 ) is transmitted in form of audio signals to at least one speech recognition (SR) module (23) through a microphone (22). Therein, the audio signals which are made meaningful through sound processing methods are converted to text and can be viewed on the monitor (24) of at least one device wherein the dictation system is established.

The said speech synthesis (TTS) module (25) synthesises the created text during process of dictating. The system automatically detects the 'rest's and pause times the user (21 ) gives during he utters phrases subsequently when speaking and causes the recognition result which has just been converted to text the user (21 ) listened to by the user (21 ) during these pauses.

For example; when the user (21 ) dictating a sentence give s rest for the next sentence, he can listen to how the sentence he has just uttered was put to text by the speech recognition module (23). Thus, the user (21 ) does not waste time by selecting the parts he wishes to listen to in the text created as the result of automatic speech recognition; he can notice the errors better since he can listen to the phrase he has just dictated.

Since the said speech synthesis (TTS) module (25) synthesises the text created through speech recognition much more rapidly than in normal, giving long pauses during dictation is not required.

In the said system, the output of the automatic speech recognition (SR) module (23) provides a two-way feedback since it is both displayed on the monitor (24) and is listened to by the user (21 ) through a headphone (26) by being converted to audio in the speech synthesis (TTS) module. When performing an extra confirmation mechanism function in cases wherein the monitor (24) can be followed; it is obvious that, it provides an absolute means of control in cases wherein the monitor (24) can not be followed. If the automatic speech recognition has accrued without a problem, the user (21 ) can continue with the dictating process in a rapid manner. In the event that there is an error in recognition, the user (21 ) can make the necessary correction through voice instruction or manually since he will be able to understand it from the text easily.

In the said system, with use of a headphone (26) instead of a loudspeaker contrary to the existing patent documents, perception of also the recordings being played on the loudspeaker while dictating process is in progress as recognition input by the speech recognition (SR) module (23) and the user (21 ) from being affected by outer medium noises during listening are prevented.

In the other preferred applications of the system which is the subject of the invention, control of the text converted to text can also be done at the end of dictating process. In this case, The speech synthesis (TTS) module begins sounding after creation of the whole text. After dictating process, the user (21 ) or another user (21 ) who will control the text can control the whole text again by reading from the monitor (24) in a manner similar to the one described above and/or by listening to the output of the speech synthesis (TTS) (25) module through the headphones. In this type of an application, segmentation module (27) steps in to facilitate the user's (21 ) task.

Function of the said segmentation module (27) is to facilitate the process of listening by dividing audio recordings created through speech synthesis into their parts and to prevent attention of the user (21 ) from being distracted and thus to prevent the control process from being affected adversely. The user (21 ) who subsequently listens to segmented parts instead of listening to the whole text in one single time will notice errors more comfortably.

In the said system, the microphone (22) and the headphone (26) components can also be used as a microphone - headphone (headset), not separately. The scope of protection of this application is determined in the Claims section and may never be limited to what is described above for the purposes of exampling. It is clear that any person who is specialized in the method could bring forward the innovation that has been brought forward by the invention and/or could apply this embodiment to the other areas of similar purposes that is used in the technique involved; therefore, such embodiments will be lacking novelty and especially the criteria of exceeding the state of art is apparent.