Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SPEECH-TO-TEXT CONVERSION
Document Type and Number:
WIPO Patent Application WO/2013/000868
Kind Code:
A1
Abstract:
A method of automatically transcribing speech comprising the steps of: processing a recorded portion of speech (14); and transcribing (22) the processed recording of the portion of speech to a text file (24) using a speech-to-text transcription algorithm (32, 22), the speech-to-text transcription algorithm (32, 22) utilising a pre-existing user profile (30) to render a substantially accurate text file (24), wherein the step of processing the recorded speech (14) comprises morphing the recorded speech (20) such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and wherein the pre-existing user profile (30) is optimised for transcribing portions of speech spoken in the standardised voice. The method may additionally comprise recording the portions of speech to be processed.

Inventors:
LEVINE ANDREW (GB)
Application Number:
PCT/EP2012/062256
Publication Date:
January 03, 2013
Filing Date:
June 25, 2012
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
LEVINE ANDREW (GB)
International Classes:
G10L15/26; H04M3/56; G10L21/00; G10L21/013
Foreign References:
US20070143103A12007-06-21
Other References:
SARASWATHI S ET AL: "Time scale modification and vocal tract length normalization for improving the performance of Tamil speech recognition system implemented using language independent segmentation algorithm", INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, KLUWER ACADEMIC PUBLISHERS, BO, vol. 9, no. 3-4, 22 September 2007 (2007-09-22), pages 151 - 163, XP019643332, ISSN: 1572-8110
HUI YE ET AL: "High quality voice morphing", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2004. PROCEEDINGS. (ICASSP ' 04). IEEE INTERNATIONAL CONFERENCE ON MONTREAL, QUEBEC, CANADA 17-21 MAY 2004, PISCATAWAY, NJ, USA,IEEE, PISCATAWAY, NJ, USA, vol. 1, 17 May 2004 (2004-05-17), pages 9 - 12, XP010717535, ISBN: 978-0-7803-8484-2
Attorney, Agent or Firm:
HUTCHINSON, Thomas (57Hoghton Street, Southport Merseyside PR9 0PG, GB)
Download PDF:
Claims:
Claims:

1. A method of automatically transcribing speech comprising the steps of: processing a recorded portion of speech; and transcribing the processed recording of the portion of speech to a text file using a speech-to-text transcription algorithm, the speech-to-text transcription algorithm utilising a pre-existing user profile to render a substantially accurate text file, wherein the step of processing the recorded speech comprises morphing the recorded speech such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and wherein the pre-existing user profile is optimised for transcribing portions of speech spoken in the standardised voice.

2. A method of automatically transcribing speech as claimed in claim 1, further comprising the step of recording the portion of speech to be recorded.

3. The method of claim 1 or claim 2, wherein the step of morphing comprises using a voice-shifting algorithm to map the actual recorded speech onto a standardised voice.

4. The method of any of claims 1, 2 or 3, wherein the speech of substantially any user is transcribeable without that user first having to create a user-specific profile for the speech-to-text algorithm.

5. The method of any of claims lto 4, wherein the speech-to-text algorithm comprises a plurality of pre-existing user profiles.

6. The method of any preceding claim, wherein the voice-shifting algorithm comprises a number of pre-set parameters.

7. The method of claim 6 when dependent on claim 5, wherein the voice-shifting algorithm comprises a set of pre-set parameters corresponding to each of the preexisting user profiles.

8. The method of claim 5, claim 6 or claim 7, wherein the speech-to-text algorithm comprises a pre-existing user profile corresponding to any one or more of the group comprising: a standardised male child profile; a standardised female child profile; a standardised male adult profile; and a standardised female adult profile.

9. The method of any preceding claim, wherein the step of recording the speech of a user comprises buffering an electronic file within a dynamic memory module or creating a real-time recording of the actual words spoken. WO 2013/000868 Δ=> PCT/EP2012/062256

10. The method of claim 9, wherein the real-time recording is stored in a data file corresponding to the text file.

11. The method of claim 9 or claim 10, wherein the text file comprises embedded multimedia files.

12. The method of claim 11, wherein the multimedia files comprise sound files corresponding to snippets of speech associated with corresponding portions of text.

13. The method of any preceding claim adapted for automatically transcribing conversations between a plurality of speakers, the method additionally comprising the steps of: recording portions of speech to be transcribed; distinguishing between different speakers, associating individual speakers with the recorded portions of speech spoken by each speaker and assigning a time stamp to each portion of recorded speech; processing each portion of recorded speech individually and transcribing the processed portions of recorded speech, using a speech-to-text transcription algorithm, as separate entries in a text file, wherein the separate text file entries are arranged in chronological order according to their time stamps and are identified within the text file as having been spoken by their respective speakers.

14. A method of automatically transcribing conversations comprising the steps of: receiving portions of speech to be transcribed; distinguishing between different speakers, associating individual speakers with the portions of speech spoken by each speaker and assigning a time stamp to each portion of speech; processing each portion of speech individually and transcribing the processed portions of speech, using a speech-to-text transcription algorithm, as separate entries in a text file, the speech-to-text transcription algorithm comprises at least one pre-existing user profile, and wherein the step of processing the speech comprises morphing the speech using a voice-shifting algorithm, such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and in which, the pre-existing user profile is optimised for transcribing portions of speech spoken in the standardised voice, wherein the separate text file entries are arranged in chronological order according to their time stamps and are identified within the text file as having been spoken by their respective speakers.

15. A method as claimed in claim 14, further comprising the step of recording the portions of speech.

16. The method of any of claims 13 to 15, wherein the step of distinguishing between different speakers comprises any one or more of the group comprising: using cues; identifying different voice patterns and distinguishing between different speakers according to differences in voice patterns; and recording each speaker's voice in a separate channel.

17. The method of claim 16, in which the cues comprise any one or more of the group comprising: manually-inputted cues; a push-to-talk system; automatic cues; making a video recording of each speaker and identifying mouth movements that are indicative of that speaker speaking.

18. The method of claim 16, in which different voice patterns are identified by configuring an algorithm to recognise differences in pitch, tone, pace, timbre, and/or raspiness.

19. The method of claim 16, wherein recording each speaker's voice in a separate channel is accomplished by any one or more of the group comprising: by reference to a physical cable via which incoming speech is received; providing each speaker with a separate microphone; utilising a directional microphone; utilizing a spatial calculation based on frequency shift or offset in the recording.

20. The method of claim 16 or claim 19, wherein each speakers' speech is recorded as a separate sound file.

21. The method of claim 16 or claim 19, wherein each speakers' speech is recorded as a separate track of a multi-channel format sound file.

22. The method of any of claims 16 to 19, further comprising the step of filtering.

23. The method of any preceding claim, wherein the speech-to-text transcription algorithm compares the speech patterns within portions of recorded speech with entries in an embedded dictionary comprising a database of sound patterns that correspond to phonemes and words and uses the comparison to determine a subset of probable words that have been spoken.

24. The method of claim 23, further comprising narrowing-down the sub-set of probable words by comparing the sub-set of probable words with a database comprising known word patterns and grammatical rules and determining from the sub-set of probable words the most likely words spoken.

25. The method of claim 24, further comprising determining a best-match phrase or sentence, and entering the best-match phrase or sentence as a portion of text into the text file.

26. The method of any preceding claim, wherein the recording step comprises eliminating periods of silence.

27. The method of claim 26, further comprising creating separate snippets of recorded sound by chopping the recorded sound a periods of silence.

28. The method of any preceding claim, further comprising inserting a time stamp in the text file corresponding to the chronology of the recorded speech.

29. The method of any preceding claim, wherein the step of morphing comprises the steps of: conforming the recorded portion of speech and audio harmonic transformation.

30. The method of claim 29, wherein the step of conforming comprises removing from the recorded portion of speech audio signals that are outside the band of frequencies usually associated with voices and normalising the recorded portion of speech.

31. The method of claim 30, wherein audio signals substantially in the range of ~300Hz to ~4kHz are removed - or ranges of frequency best determined to isolate a clean voice; as determined by environment, processing, equipment and speaker's voice qualities.

32. The method of claim 30 or claim 31, wherein normalising comprises setting the maximum signal peaks to a reference level.

33. The method of any of claims 29 to 32, wherein the step of conforming comprises adjusting the recorded portion of speech's pitch and spectral vocoding.

The method of claim 33, wherein the recorded portion of speech's pitch is adjusted using a time shifting function to move the fundamental pitch of the voice substantially without affect the pace of the recording.

35. The method of claim 33 or claim 34, wherein spectral vocoding comprises re-shaping the audio signal to enhance the acoustic energies in a particular frequency pattern.

36. A speech-to-text transcription apparatus for performing the method of any preceding claim comprising: at least one microphone for recording at least one speaker's voice, a data storage device for storing an electronic recording of the speaker's and a CPU for executing at least one algorithm on the electronic recording.

37. A method substantially as hereinbefore described, with reference to, and as illustrated in, the accompanying drawings.

38. A speech-to-text transcription apparatus substantially as hereinbefore described, with reference to, and as illustrated in, the accompanying drawings.

Description:
SPEECH-TO-TEXT CONVERSION

Description:

This invention relates to speech-to-text conversion, and in particular, but without limitation to automated dictation and transcription methods.

Enabling a computer to accurately utilise and/or understand the spoken word has been a goal of computer science for a number of decades. In recent years, the accuracy and usability of speech-to-text software has improved considerably, meaning that off-the-shelf software, or even software embedded within computer operating systems is now readily available, which is capable of being used to issue simple computer instructions that would control a computer with voice commands, in addition to being able to provide dictation functionality. The usefulness and complexity of the available software varies tremendously, however, and is dependent on a number of factors such as the quality of the microphone used to capture the spoken word and the processing power of the computer itself. Equally, factors related to the speech interpretive engines used have a direct bearing on the capabilities and accuracy of spoken instructions and dictation. Interpretive engines that provide basic computer instructions have a very limited and prescriptive dictionary and require the user to adhere to specific 'key words' and scripts. Whereas interpretive engines for dictation are challenged with a near infinite variability in language expression; thereby being completely antithetical to the rules-based interpretive engines used today. In short: reliable, repeatable and accurate transcription is achievable only by imposing strict constraints. To provide interpretive engines with broader use, especially in response to dictation, some speech-to-text software systems require a user to "train" the interpretive engine. Training an interpretive engine is accomplished by providing it with extensive samples of your enunciation, voice qualities, language use and other characteristics, and even then, overall accuracy and reliability would never be 100%. The "trained" system can more accurately transcribe your speech, and at the same time, would be less able to transcribe anything other than your voice. This is problematic should your voice change due to a cold, or you change the environment where your voice is typically recorded, or in extreme cases a trained system has to be completely retrained if you purchase a new microphone, new computer or upgrade the software that manages the transcribing. In any event, a highly trained interpretive engine would not be able to transcribe a multi person conversation.

In very general terms, speech-to-text software essentially works by recording or buffering speech segments using a microphone and by breaking the speech segments into smaller sub-units, e.g. phonemes. The software incorporates a built in "dictionary" containing a list of words and a phonetic representation of each word. The software then maps the captured strings of phonemes onto words in its dictionary, and by using word associations, e.g. words that often follow other particular words, and other speech patterns, will attempt to reconstruct the phrase or sentence as text data that can be used in word processing applications, e-mail client applications, and/or as voice commands to control the computer itself. The steps of recording, splitting, mapping and checking against control patterns consumes a lot of processing power and is inherently error prone.

Besides conventional dictation and/or the control of the computer, one particular application where speech-to-text recognition will be particularly useful is in transcribing conversations or meetings. For example, if a system were available that would automatically record for the number of people say and produce a text document that can serve as a record of that conversation, then that would be extremely useful. In addition, the ability for a computer to be able to record and transcribe a telephone conversation between two unknown voices (users) will be particularly useful, especially from an evidential point of view, for example where, say, an insurance policy is sold to a customer based on questions asked and answered during a conversation over the telephone.

However, for the reasons outlined above, the technology to implement such a system is not yet available because the accuracy and reliability of existing speech-to-text recognition programs is insufficient to be able to provide a reliable record of a conversation unless the participants have taken the time to produce, and make available, their individual speech-to-text user profiles. In other words, existing speech-to-text programs do not function properly or reliably with new users, multiple users or "stranger voices". Further, speech-to-text systems are not designed to manage multiple voice profiles simultaneously.

Notwithstanding the above, existing speech-to-text programs encounter a great deal of difficulty when trying to distinguish between the voices of a number of simultaneous users. As such, when trying to transcribe a conversation involving more than one person, existing speech-to-text recognition programs generally fail to differentiate between the voices. Therefore, not only does the existing speech-to-text transcription technology encounter difficulty in accurately transcribing what is said, it also encounters difficulty in compiling an accurate chronological and user-separated record of what was said by the respective parties.

Errors in speech-to-text conversion can occur for a variety of reasons. The most fundamental error occurs when the user says a word that is absent from the "dictionary", in which case, no phoneme match can be made. This can be overcome by the use of phoneme-only voice recognition whereby words are effectively spelled-out using grammar rules. Beyond that, there are factors such as quality of recording, background noises, the environment where the recording was made, the processing abilities of the computer, accents, slurring, abbreviation, volume, frequency, hoarseness, pitch, tonal shift, raspiness, speed, timing and even speech-affecting physiology (lisp, fractured bilabial, deviated septum) etc., which can all have an effect on the accuracy of speech recognition.

To combat this, many speech-to-text programs incorporate "training" modules whereby a user can read stock text samples to the software so that the software can modify its internal dictionary's phoneme-word mappings. In addition, it is sometimes also possible to get the software to scan pre-existing files containing text that a user has created to search for commonly recurring phrases, and so build-up a user-specific lexicon. In many cases, the process of "teaching" the software to "recognise" a user's voice occurs automatically with the software iteratively creating and modifying a user "profile". Over time, the user's profile can become more sophisticated until eventually dictation and voice recognition accuracy can be better than 90%.

Known speech-to-text software packages can be "stand-alone", in which the software is run on a single computer and the computer's users each have separate user profiles that can be imported into the software, or "distributed", whereby the software or profiles can be stored and/or executed on networked or internet-based servers or client machines. In either case, an individual's user profile must be accessed before the software can be run effectively, which places a large processing and data transfer burden on the system.

Besides the need for the computer running the software to have high-level processing capabilities (which is not always a given where a server-client, thin client, laptop or web-based implementation is used), a major disadvantage of known speech-to-text recognition software is that it relies on well-developed user profiles to obtain high accuracies. At present, it can take in excess of 200 hours of continuous use to train a speech-to-text software package to an accuracy of greater than 95%. This is an extremely time-consuming process and often results in users "giving up" on the software before they are able to realise its full benefits. Again, this investment may be completely undone should there be a change to the hardware, microphone, environment or change to the speakers voice, which means that accuracy in the speech-to-text system dropping to an level that renders in un-usable.

As such, whilst transcription of individual voices is something that has been virtually perfected in recent years, conversation is something that speech-to-text applications find very difficult. Multiple voices and speakers require advanced processing abilities, adaptive technologies and advanced logic / synthesis analysis to convert to text accurately and quickly.

In addition, certain scenarios require not only conversations to be transcribed, but chronological order and timings to be maintained. This means that there is a need for multiple speakers to have their speech converted to text and formatted in the chronological order in which the conversation took place. The challenge with this is that transcription systems discard silence and do not attribute multiple speakers to specified speech-to-text segments. When transcribed text is received back there is no notion of time within it, nor is it formatted in a manner that coherently illustrates specific speakers.

It is an object of the present invention to provide a solution to one or more of the above problems and/or to provide an improved/alternative version of speech-to-text recognition. A first aspect of the invention provides a method of automatically transcribing speech comprising the steps of: processing a recorded portion of speech; transcribing the processed recording of the portion of speech to a text file using a speech-to-text transcription algorithm, the speech-to-text transcription algorithm utilising a pre-existing user profile to render a substantially accurate text file, wherein the step of processing the recorded speech comprises morphing the recorded speech using a voice-shifting algorithm, such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and wherein the pre-existing user profile is optimised for transcribing portions of speech spoken in the standardised voice.

An embodiment of the invention may additionally comprise the step of recording a portion of speech to be transcribed.

One of the advantages of an embodiment of the invention may be that it enables the speech of substantially any user to be transcribed without that user first having to create a user-specific profile for the speech-to-text algorithm since the user's actual voice is mapped onto a standardised voice for which the speech-to-text algorithm already has a well- developed user profile pre-installed.

Because different users could have vastly differing voice patterns, which could require certain voice types to be morphed excessively to match the standardised voice, the speech-to-text algorithm may comprise a plurality of pre-existing user profiles and the voice-shifting algorithm may have a corresponding number of pre-set parameters. For example, users could be categorised by sex and/or age, and could be required to specify whether they are a male child, a female child, a male adult or a female adult, and the voice- shifting algorithm's parameters could be adjusted to optimise morphing of the recorded speech onto a corresponding pre-existing user profile (i.e. a standardised male child profile, a standardised female child profile, a standardised male adult profile or a standardised female adult profile, respectively).

The step of recording the speech of a user may comprise buffering an electronic file within a dynamic memory module, but may preferably comprise creating a real-time recording of the actual words spoken. The actual recording may be stored in a data file corresponding to the text file for subsequent, manual comparison and checking purposes.

The text file created by an embodiment of the invention preferably comprises embedded multimedia files, for example, sound files corresponding to snippets of speech associated with corresponding portions of the text.

A second aspect of the invention provides a method of automatically transcribing conversations comprising the steps of: processing portions of speech to be transcribed; distinguishing between different speakers, associating individual speakers with the portions of speech spoken by each speaker and assigning a time stamp to each portion of speech; processing each portion of speech individually and transcribing the processed portions of speech, using a speech-to-text transcription algorithm, as separate entries in a text file, the speech-to-text transcription algorithm comprising at least one pre-existing user profile, and wherein the step of processing the speech comprises morphing the speech using a voice- shifting algorithm, such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and in which, the pre-existing user profile is optimised for transcribing portions of speech spoken in the standardised voice, wherein the separate text file entries are arranged in chronological order according to their time stamps and are identified within the text file as having been spoken by their respective speakers.

The second aspect of the invention may additionally comprise the step of recording a portion of speech to be processed. The speech-to-text transcription algorithm of an embodiment of the invention preferably utilises a pre-existing user profile to render a substantially accurate text file. The step of processing the recorded speech may comprise morphing the recorded speech using a voice-shifting algorithm, such that the processed recording resembles the same portion of speech as spoken by a standardised voice, and in which case, the pre-existing user profile is preferably optimised for transcribing portions of speech spoken in the standardised voice.

The step of distinguishing between different speakers may be accomplished by any one or more of the group comprising: using cues; identifying different voice patterns and distinguishing between different speakers according to differences in voice patterns; and identifying, via special frequency shifts in the recorded file, precise spatial coordinates and distance to the microphone for each person in the room thereby provided an accurate 'voice vector map' for all speakers and thereby recording each speaker's voice in a separate channel.

Cues may be manual or automatic cues. Manually-inputted cues may comprise a push-to-talk system, whereby each user is required to depress a button or otherwise indicate when they are speaking, whereas an automatic cuing system may involve making a video recording of each speaker and identifying mouth movements that are indicative of that speaker speaking.

Different voice patterns could be identified by configuring the system to search and track differences in pitch, tone, pace, timbre, raspiness, etc.

Most preferably, an embodiment of the invention records each speaker's voice in a separate channel, which could be accomplished in a variety of ways. For example, different speakers in a telephone conversation could be differentiated with reference to the telephone line that their incoming speech is received on, each speaker could be provided with his/her own microphone or a directional microphone could be used that is capable of separating speech originating from different locations. Where multi-channel recording is used, each speakers' speech could be recorded as a separate sound file, or several speakers' speech could be recorded separately albeit within a single, multi-channel format sound file.

Where multi-channel recording is used, an embodiment of the invention preferably comprises means associated with each channel for filtering out sounds recorded on the other channels.

The speech-to-text transcription algorithm preferably works by comparing the speech patterns within portions of recorded speech with entries in an embedded dictionary comprising a database of sound patterns that correspond to phonemes and words, and using the comparison to determine a sub-set of probable words that have been spoken. Thereafter, the speech-to-text algorithm may narrow-down the sub-set of probable words by: comparing the sub-set of probable words with a database comprising known word patterns and grammatical rules; determining from the sub-set of probable words the most likely words spoken; and determine a "best-match" phrase or sentence, which is then entered as a portion of text into the text file.

The recording step may comprise filtering the detected speech to eliminate periods of silence. Separate snippets of recorded sound could be created by chopping the recording during periods of silence, which are ignored anyway by the speech-to-text algorithm. The term "chopping" means separating the speech or recorded portions of speech into discrete portions or snippets to remove silences. However, because an embodiment of the invention attributes a time stamp to each speaker or snippet of speech, it is possible to represent on paper (or otherwise) the pace of a conversation or periods of contemplation, for example by inserting an "entry" and "exit" time for each portion of text. As such, an embodiment of the invention may preserve the chronology of a conversation, whereas known in a known speech-to-text transcription, this often vital information would be lost.

Preferred embodiments of the invention shall now be described, by way of example only, with reference to the accompanying drawings in which:

Figure 1 is a schematic representation of a system according to an embodiment of the invention for transcribing speech;

Figure 2 is a schematic representation of a system according to an embodiment of the invention for transcribing multi-speaker conversations;

Figure 3 is a schematic wave pattern observed on a single channel;

Figure 4 is the schematic wave pattern of Figure 3, but with a gate around the speech or sound or identified breaks in the portion thereof;

Figure 5 is a cue sheet of sound snippets as captured by an embodiment of the invention;

Figure 6 is master channel/sequence cue sheet;

Figure 7 is a chronological sequence of the snippets based on the master channel/sequence cue sheet;

Figure 8 is a schematic representation of a master transcription output produced by an embodiment of the invention;

Figure 9 is a detailed schematic of an interview transcription apparatus according to an embodiment of the invention;

Figure 10 is a schematic of the audio chunking component of the apparatus of Figure

9; and

Figure 11 is a schematic of the vocal transformation component of the apparatus of Figure 9. In Figure 1, a speech-to-text transcription system 10 comprises a microphone 12 for recording the speech 14 of a speaker 16 and a data storage device 18 for electronically recording the recorded speech as a master sound file. In most, but not all cases, the system will be implemented on one or more computers, which are not shown per-se in the drawings for simplicity. Instead, each of the functional elements of such a computer, computers or network of computers is illustrated schematically.

Once recorded, the master sound file 20 can be accessed by a speech-to-text transcription program 22, which converts the speech 14 into a text file 24.

The master sound file contains a timeline enabling various portions of the recording to be time-stamped.

The speech-to-text transcription program 22 interfaces with the data storage device 18, which also stores a database 28 containing sound-word mappings to enable the program 22 to short-list probable words spoken according to sound patterns extracted from snippets of the master sound file 20 and a user profile 30 containing user-specific modifications to the sound-word mapping database 28.

A main processing algorithm 32 of the speech-to-text transcription program 22 accesses the master sound file 20, the sound-word mapping database 28 and the user profile 30 to convert the master sound file 20 into the outputted text file 24. The text file 24 includes, at intervals, the time that various passages were spoken or intervals between various passages, as determined by the time stamps embedded within the master sound file 20.

An embodiment of the invention is characterised by a morphing algorithm 34 that operates to morph the recorded speech or the speech recorded by the microphone onto a morphed copy stored on the data storage device 18. Most preferably, the morphing step occurs between the master sound file 20 and the main processing algorithm 32 of the speech-to-text transcription program 22. The morphing algorithm 34 maps the actual recorded speech 14 onto a simulation of the same speech as spoken by a standardised user. The user profile 30 has been optimised for the simulated, standardised speech.

The morphing algorithm comprises a number of filters that can carry out a range of operations to map the actual recorded speech onto a standardised voice. The various operations include, but are not limited to: speeding up or slowing the recorded speech to match the pace of the recorded and standardised voices, adjusting the pitch of the recorded speech, noise suppression and smoothing filters to remove raspiness and wavering and to match the tone of the recorded voice to that of the standardised voice.

The morphing algorithm takes a vocal recording from any speaker and, by way of pre-processing it, transforms the recording to a much narrower range of 'generalised synthetic ' vocal characteristics. The pre-processing profiling of the voice allows greater certainty in transcription and more variety in the types and qualities of voices recorded. The automated processing/transcriptions of the synthesised voices requires less monitoring, and produces more accurate transcriptions because the voice dynamics, qualities, harmonics and 'uniqueness to an individual' characteristics and nuances are removed. This process has two core functional areas: audio conformation and audio harmonic transformation.

At the conformation stage, on presenting a vocal digital audio file, the morphing step (usually executed using an algorithm, but this could also be achieved using analogue or digital circuitry) initially examines the raw audio information and audio signals that are outside the band of frequencies usually associated with voices (i.e. in the range of ~300Hz to ~4kHz) and removes these sounds. This signal then has the necessary compression applied to bring the dynamic range to reference levels. Finally the audio is normalised, e.g. by setting the maximum signal peaks to a reference level, such as -0.6dB, and stored. The conformed audio is then scanned using an automated scanning function, such as, for example, a Fast Fourier Transform (FFT). The FFT function ascertains the fundamental and primary frequency range(s) and generates an abstract score of the audio signal's entropy (chaotic variability, tonal quality, speed, frequency, designed to flag speech with interference). These parameters are used to suggest the base profile used for the later audio transformation stage, which base profiles correspond to the standard set of synthetic profiles which are baselines. The conformation process, coupled with the FFT function, develops a file and ascertains which synthetic profile is closest to the originating voice.

The harmonic transformation stage takes the stored conformed audio, and the 'suggestion' from the FFT scan to select from a set of parameters stored in a database. These parameters define how the conformed audio's harmonic content will be adjusted so that a voice of consistent pitch and stable harmonic content can be derived. The first step in this process uses a time shifting function to move the fundamental pitch of the voice without affect the pacing of the recording. Then spectral vocoding (with formant shaping) is used to bring out the acoustic energies in a particular frequency pattern (and assist in shaping vowel sounds). The pattern of frequencies used will usually be variable according to the profile and tuned to bring out the best conversion to text possible.

The process outlined will pre-process a vocal audio, which in turn, will be stored alongside the conformed audio so that the transformation function can be easily called again to create new output audio with a different profile - or create a new pre-processed file with a known profile.

Figure 2 is largely the same as Figure 1, except that it comprises a multi-track recoding device 50 connected to a plurality of input devices (e.g. microphones, telephone wires, outputs of a multi-directional microphone etc.) 52. In this case, the master sound file 54 comprises a number of channels, which are gated and processed separately (for example, sequentially or in parallel by separate speech-to-text transcription programs 22).

To convert the incoming multichannel audio into speaker based, audio chunks that minimise acoustic interference, a process of audio 'chunking' is used, using a bespoke, multichannel noise gate array.

With a variety of recording techniques (cf. the illustrated example in which four highly directional microphones are used), the current speaker's speech signal will clearly register with one microphone over its neighbours. The gating DSP function is programmed to look for a differential signal over a particular threshold. Once triggered the function looks forward to the point where the signal drops below the threshold for more that x milliseconds (where x represents an optimal interval, derived in sampling the voice file and / or via testing a variety of sound files). The system then places 'in' and 'out' markers a few milliseconds either side of the speech transitions.

Once the multichannel audio has been marked up, the file is split into four mono audio files, and each of those is scanned for the Ίη/Out' markers to extract new mono audio file chunks.

Each chunk is given a timer offset (based on the beginning of the recording) and channel reference, stored either solely, or in a combination of the filename, file metadata, or in a custom data wrapper. No assumption is made about how the chunks overlay each other, they are simply declared as interesting units of speech content, worthy of conversion attempt. The time index of each chunk will be all that is needed to sequence the chunks later. Thereafter, the speech 14 recorded on each channel is converted to a separate text file 54, 56, 58, 60 with various snippets of text being associated with time stamps and a speaker indicia.

The separate text files are then combined in a conversation transcription editor 62, which places the snippets of text from each text file in chronological order according to their time stamp in a master transcript text file 64.

The master transcript text file 64 can additionally comprise embedded snippets of the master sound file corresponding to various portions of text, for example a hyperlink at the end of each paragraph linking that paragraph to a bookmarked, or corresponding, portion of the master sound file 20 such that a user can manually verify the transcript by playing back a portion of the master sound file 20.

In one specific embodiment of the invention, a conversation is recorded via a multichannel device and saved in a format known as WAV file, or a Broadcast WAVE file (WAV Broadcast or WAVB) file format. In particular, an advanced WAVB file inherently contains a time stamp, which feature is relatively well known. It is possible that an alternative file format is used for transport of the recording (for example ACC) and WAVB is used herein as an exemplary embodiment, but not the sole, file format option available.

From the point of view of the present invention, one of most important features of the file format (WAV in this example) is the ability to synchronize multiple segments and sound files together by a process known as timecoding.

After having recorded a conversation in a master sound file 20, some post processing of the WAV file is required to reduce background noises, isolate individual speakers (especially if, say one speaker's voice has been picked up by multiple microphones) and stabilize the speakers voices to enhance speech-to-text conversion. This initial post-processing comprises the following three main enhancements:

Firstly, the file is reviewed and any silences related to when a person is not speaking are nullified. This is accomplished by using a common technology modifier called 'Noise Gating', which reduces the non-speaking sounds of a speaker to near zero. Another speaker whose voice is recorded on several microphones is 'gated' so that only his/her voice on a single microphone (channel) is kept while the others are 'noise' that is gated away. Gating works with threshold levels and can be manipulated manually or automatically. Gating is required to ensure that there is a single coherent flow of a conversation that has been recorded and that there is no 'echoed speech' from the same speaker recorded on multiple tracks / channels.

Secondly, the recorded speech needs to be sequenced so that the entire 'text' component can be assembled and formatted into a logical time-stamped sequence that accurately reflects a conversation between multiple speakers.

Thirdly, the recording needs to be stabilised. All speech-to-text systems require a voice profile to accurately transcribe voice to text. This profile often needs to be 'tuned' over many conversations and has some abilities to 'learn' the characteristics the speaker's voice. In some cases, it takes more than two-hundred hours of advanced voice training to make a voice-to-text system work accurately. Timing, frequency, timbre, tone, raspiness, hoarseness, speed and resonance are some of the characteristics of a voice and need to be captured within a profile before accurate transcription is possible. An embodiment of the invention is very different to known systems in this regard inasmuch as it uses a voice modelling system that converts the actual recorded speech of the speakers into a small set of common voices. With this step, a small number of profiles can be used for which very accurate profiles for each 'voice type' are pre-installed in the system. As such, an embodiment of the invention shifts the "effort" of speech-to-text recognition from recognition of the speech per-se, to mapping the recorded speech onto a simulated, synthetic, standardised voice, which is a much simpler task and greatly increases the speed and accuracy of the translation / interpretive engine.

Once the separate text files 54, 56, 58, 60 containing the various snippets of text associated with time stamps and a speaker indicia have been created, a master transcript text file 64 must then be compiled by sequencing the separate text files.

Figure 3 is a schematic wave pattern observed on a single channel, which represents a single speaker speaking. The high amplitude areas are specific to a speaker saying something, whereas the low amplitude regions correspond to "silence". Figure 4 shows how, by setting a threshold for background noise, it is possible to determine portions of the file where the speaker is speaking. In Figure 4, the below the pre- and post-noise waves are easily identified and the pertinent speaking parts are extracted by use of a gating threshold 70.

As shown in Figure 5, the embedded time code within the file can be used to produce a cue sheet per channel (speaker). Specifically, Figure 5 shows the extractable speech segments for a single channel (single speaker). For clarity, this is designated Channel 1 (speaker 1) and each Sequence is contiguous speech. The process of identifying individual sequences is repeated for each Channel (speaker) that exists within the conversation.

Notably, it is common to have a single sound file for each speaker (WAV, WAVB, ACC) - recorded at the same time on a multi-channel recorder 50. After the recording is made, each individual channel is exported separately to make a separate file 54, 56, 58, 60, i.e. if there are four speakers, there would be four synchronised files produced. Ultimately a master Channel/Sequence cue sheet, as shown in Figure 6, can be produced, enabling all of the text snippets 80 from all of the files 54, 56, 58, 60 to be interlaced/compiled in chronological order, according to their time stamps 74, as shown in Figure 7.

Once the sequence is constructed, there are two approaches to transcribing the text into the master transcription output. A first option involves slicing, in which the channel can be divided in to a number of individual pieces. To avoid any danger of cutting off portions of text plenty of padding (silent time - 'gated' time) can be added to either side (the transcription will ignore this anyway). Each portion is then transcribed. The text received back is labelled with the channel and sequence and assembled into a single, sequenced transcribed document.

A second option involves inserting markers, which avoids the need to split the channel into individual parts. This approach could still render a sequence-able text file provided that when then transcribed text is returned from the transcription software there is a mechanism to identify which parts relate to which sequence. In order to ensure that this occurs, it may be possible to insert a control sound or control code, such as "new paragraph" or a special symbol like "ø¥" in between each section of speech. When the text is transcribed each portion then appears in a separate paragraph, which can then be labelled with the channel and sequence. This marker is an inaudible control character in the file that does not impact the integrity of the recording. Further, the marker acts like a 'Channel/Sequence' separator enabling reassembly into a single document. The text received back is labelled with the channel and sequenced and assembled into a single, sequenced transcribed document. Once the individual pieces of text have been transcribed it is then a relatively straightforward task to reassemble them using the cue sheet into a master transcript file 64, for example, as shown in Figure 8.

In Figure 8, it can be seen that the master transcript file 64 comprises a table containing a user identifier 72, a time stamp 74 consisting of a start time 76 and an end time 78 corresponding to a transcript of the actual words spoken 80. The master transcript file 64 additionally comprises a hyperlink 82 or embedded object associated with each portion of the transcript 80, which hyperlink links to a separate sound file snippet corresponding to the words spoken 80 or to a bookmark 66 in the original sound recording 20.

Figures 9 to 11 are more detailed schematics of one exemplary embodiment of an embodiment of the invention. Identical reference signs have been used in Figures 9 to 11 to identify identical features shown in Figures 1 to 8 for ease of reference.

In Figure 9 a speech-to-text conversation transcription device comprises a digital recording device 50 connected to four directional microphones 52. The sound detected by each microphone 52 is recorded on a separate track of the digital recording device 50 and the microphones 52 can be activated/deactivated and otherwise configured using a user interface of the recording device 50, in this case, a touch screen interface 90. The recording device 50 outputs a master, multi-track sound recording 20, which comprises additional data 92 regarding the date/time of the recording and labels for each of the tracks 94.

In this example, the recording device 50 is located remotely from the main processing components of the apparatus 10. In this case, the master recording 20 is transmitted to the main processing unit 18, 22 (not shown), via a secure network connection 96. Upon receipt of the master recording 20, the main processing unit 22 prioritises different incoming jobs according to pre-set criteria using an allocation server 98 or ingest machine. Incoming master recordings 20 are either passed directly to the main processor, stored for later transcription locally or re-directed to another facility depending on the preset criteria.

An un-processed copy of the master recording 20 is archived on a file server 18 after having been indexed and other meta data added to it 100. Meanwhile, a further copy of the master file 20 is allocated a unique identifier 102 before being passed to an audio chunker 104. The operation of the audio chunker 104 shall be described in detail below, but once the master sound file has been split into chunks of speech, the individual chunks 106 are also saved for future reference on the file server 18.

The chunks 106 are then passed through a vocal transform processor 34, which morphs the chunks of speech 106 into standardised voices 110, which can be relatively straightforwardly transcribed using a speech-to-text conversion algorithm 32.

The output of the speech-to-text conversion algorithm 32 is a text file segment 54 corresponding to each chunk 106 of speech. A quality check 108 is then carried out to ensure that the transcribed text is acceptable. Most automatic speech recognition systems embed a mark-up score that relates to the accuracy of the transcribed text. If the transcription accuracy falls below a desired level, that is to say, if it fails or if the mark-up score is unacceptable, then an alternative vocal transform (morph) is used and the speech chunk 106 re-processed until a satisfactory result is achieved. Upon passing the quality check, the transcribed text files 54 are linked to their corresponding recordings in the master recording 106 facilitating a quick cross-reference between the speech and transcribed text portions. The transcribed text file portions 54 are then passed into a conversation transcription editor 62 (document assembly) and a master transcript 64 of the conversation is outputted.

In Figure 10, the audio chunking processor 104 analyses the recorded tracks 94 master, multi-channel recording 20 and gates each track 94 to remove periods of silence or non-speech between the actual speech portions 114 of each track 94. "In" and "out" markers are then applied either side of the speech portion 114, and everything outside the markers discarded to leave individual chunks of speech 106. The chunks 106 are then labelled according to a track identifier, and "in" and "out" time and outputted as a batch separate chunks 106 for later transcription and/or re-assembly in chronological order.

Finally, the vocal transform processor 34, or morphing algorithm is shown in Figure 11 and comprises an initial conform stage 118 whereby the chunks 106 are conformed to a base acoustic profile by adjusting the dynamics, spectral EQ. and by normalising the levels of the raw sound file. An example of this would be a recording of a sentence where the speaker says "Hello, my name is Bob" but as the recording was made, there was background noise, someone else speaking, multiple microphones recording the speech at the same time etc. By adjusting the dynamics of the recording, and by adjusting the levels of the entire recorded sound spectrum, it is possible to adjust and compare the speaker against any other sounds that were recorded, thereby isolating what the speaker actually said. Such a technique compared the speaker's voice to a synthetic/standardised voice profile and can thus, with greater certainty, submit what the speaker actually said for further processing. The conformed audio file 120 is then profiled 122 to check whether it readily reads onto any of the pre-stored synthetic voice profiles 30 of the system 10. Profiling 122 used a dedicated profile controller 124 which compares various parameters of the conformed audio file 120 with those of the pre-stored profiles 30 and identifies, where possible, a match.

Identification of matches is accomplished using a process controller 126 which monitors and tracks the chunks 106 and decides whether the conformed audio is a good enough match with a pre-stored profile, in which case it permits the transcription stage to proceed, orders another attempt at profiling/conforming, if no match can be found, or abandons the matching process etc.

Meanwhile, assuming that a match can be found, the profile controller 124 begins morphing the conformed sound file 120 onto one of the pre-set profiles 30 in a transform process, whereby the pitch, spectral coding and levels of the conformed audio file 120 are adjusted to match those of the pre-set synthetic voice profiles 30, enabling a morphed sound file 110 to be outputted to the speech-to-text transcribe 32 as previously described.

The invention is not restricted to the details of the foregoing embodiments, which are merely exemplary of the invention. For example, the data storage device may comprise a hard disk drive, a RAM module, or a server- or web-based data storage facility. In addition, the master transcript file need not necessarily be a single entity as it could exist in a virtual format as a database of links to portions of text, data and other files/documents located elsewhere.