ROBINSON, Charles Q. (100 Potrero Avenue, San Francisco, California, 94103-4813, US)
TSINGOS, Nicolas R. (100 Potrero Avenue, San Francisco, California, 94103-4813, US)
ROBINSON, Charles Q. (100 Potrero Avenue, San Francisco, California, 94103-4813, US)
| CLAIMS 1. A concatenative synthesis method, including the steps of: (a) analyzing speech in a first language to generate first data indicative of a sequence of snippets of the speech in the first language; and (b) generating scrambled speech data indicative of speech in a second language using concatenative synthesis, including by concatenating grains indicative of snippets of the speech in the second language in response to the first data. 2. The method of claim 1, wherein steps (a) and (b) are performed in real time in response to input speech data indicative of the speech in the first language. 3. The method of claim 1, wherein the second language is an unintelligible, synthetic language. 4. The method of claim 1, wherein step (a) is performed by analyzing input speech data indicative of the speech in the first language, and said speech in the first language is uttered by a participant in a game in which at least one additional participant also participates, said method also including the step of: selectively providing one but not both of the input speech data and the scrambled speech data to the additional participant. 5. The method of claim 1, wherein step (a) is performed by analyzing input speech data indicative of the speech in the first language, and said speech in the first language is uttered by a participant in an online chat room in which at least one additional participant also participates, said method also including the step of: selectively providing one but not both of the input speech data and the scrambled speech data to the additional participant. 6. The method of claim 1, wherein step (b) includes steps of : retrieving grains, indicative of snippets of speech in the second language, from a stored dictionary in response to the first data; and concatenating the grains to generate the scrambled speech data. 7. The method of claim 6, wherein step (a) is performed by analyzing input speech data indicative of the speech in the first language, steps (a) and (b) are performed by a speech synthesis system, and the speech in the first language is input speech of a participant in one of a chat room and a game, said method also including the step of: asserting the input speech data from a client device to the speech synthesis system. 8. The method of claim 6, wherein step (a) is performed by analyzing input speech data indicative of the speech in the first language, steps (a) and (b) are performed by a speech synthesis system, the speech in the first language is input speech of a participant in one of a chat room and a game, and said method also includes the step of: asserting the input speech data, with metadata indicative of at least one characteristic of the input speech, to the speech synthesis system. 9. The method of claim 6, wherein the dictionary is an audio data file indicative of at least one exemplary utterance in the second language, and the grains are indicative of segments of the at least one exemplary utterance in the second language. 10. The method of claim 6, wherein the dictionary is an audio data file indicative of at least one exemplary utterance in the second language and at least one pitch-shifted version of the exemplary utterance in the second language, some of the grains are indicative of segments of the exemplary utterance in the second language, and other ones of the grains are indicative of segments of the pitch-shifted version of the exemplary utterance in the second language . 11. The method of claim 6, also including the step of: generating the dictionary, including by segmenting each said exemplary utterence into individually retrievable ones of the grains. 12. The method of claim 1, wherein step (b) includes a step of generating the scrambled speech data using concatenative synthesis, in response to metadata indicative of at least one characteristic of the speech in the first language and in response to the first data. 13. The method of claim 12, wherein step (a) is performed by analyzing input speech data indicative of the speech in the first language, and the metadata are indicative of at least one of timbre of the speech in the first language, pitch of the speech in the first language, a user-specified timbre for the speech in the second language, and a user-specified pitch for the speech in the second language. 14. The method of claim 1, wherein step (b) includes a step of performing coded- domain synthesis, and step (b) includes steps of : retrieving pre-encoded grains, indicative of snippets of speech in the second language, from a stored dictionary in response to the first data; and concatenating the pre-encoded grains to generate the scrambled speech data. 15. The method of claim 14, wherein the pre-encoded grains have been pre-encoded including by compressing grains indicative of the snippets of speech in the second language. 16. The method of claim 1, wherein steps (a) and (b) are performed by an audio server, step (a) is performed by analyzing input speech data indicative of the speech in the first language, and said method also includes the step of: asserting the input speech data from a client device to the audio server. 17. The method of claim 16, wherein said speech in the first language is uttered by a user of the client device, the user of the client device is a participant in one of a game and an online chat room in which at least one additional participant also participates, the additional participant is a user of a second client device, and said method also includes the step of: selectively asserting one of the input speech data and the scrambled speech data from the audio server to the second client device. 18. A concatenative synthesis method, including the steps of: (a) analyzing input speech data indicative of speech in a first language to generate grain identifiers indicative of dictionary locations of scrambled speech grains indicative of snippets of speech in a second language which correspond to a sequence of snippets of the speech in the first language; and (b) generating scrambled speech data indicative of speech in the second language using concatenative synthesis, including by retrieving the scrambled speech grains indicative of snippets of speech in the second language, from locations of a stored dictionary in response to the grain identifiers, and concatenating the scrambled speech grains retrieved from the stored dictionary to generate the scrambled speech data. 19. The method of claim 18, wherein the second language is an unintelligible, synthetic language. 20. The method of claim 18, wherein step (a) is performed in a first device, and step (b) is performed in a second device, said method also including steps of: asserting the input speech data and the grain identifiers to an audio server; and forwarding the input speech data and the grain identifiers from the audio server to the second device. 21. The method of claim 20, wherein the first device is a client device coupled to the audio server and the second device is another client device coupled to the audio server. 22. The method of claim 20, wherein said speech in the first language is uttered by a user of the first device, the user of the first device is a participant in one of a game and an online chat room in which at least one additional participant also participates, and the additional participant is a user of the second device. 23. The method of claim 20, wherein the audio server includes a scrambling matrix including ports, the ports include a first port to which the first device is coupled, a second port to which the second device is coupled, and at least one other port, and the scrambling matrix is configured to forward the input speech data and the grain identifiers from the first port to any selected subset of a number of different subsets of the ports. 24. The method of claim 18, wherein step (a) includes a step of generating metadata indicative of at least one characteristic of said speech in the first language, and step (b) includes a step of generating the scrambled speech data using concatenative synthesis in response to the metadata and the grain identifiers. 25. The method of claim 18, wherein the stored dictionary is an audio data file indicative of at least one exemplary utterance in the second language, and the scrambled speech grains are indicative of segments of the at least one exemplary utterance in the second language. 26. A system for concatenative synthesis, including: a memory in which a dictionary is stored, where the dictionary includes grains indicative of speech in a second language; and a speech synthesis subsystem, coupled to the memory and configured to analyze speech in a first language to generate first data indicative of a sequence of snippets of the speech in the first language, and to generate scrambled speech data indicative of speech in the second language using concatenative synthesis, including by retrieving from the memory and concatenating grains indicative of snippets of the speech in the second language in response to the first data. 27. The system of claim 26, wherein the second language is an unintelligible, synthetic language. 28. The system of claim 26, wherein the speech synthesis subsystem is configured to analyze input speech data indicative of the speech in the first language and received from a client device, said speech in the first language is uttered by a user of a client device, the user of the client device is a participant in one of a game and a chat room in which at least one additional participant also participates, and the additional participant is a user of a second client device, wherein the system also includes: a switching subsystem including ports, said ports including a first port to which the client device is coupled and a second port to which the second client device is coupled, wherein the switching subsystem is configured to selectively assert one but not both of the input speech data and the scrambled speech data to the second port. 29. The system of claim 26, wherein the speech synthesis subsystem is configured to analyze input speech data indicative of the speech in the first language to generate the first data, and wherein the system includes: an audio server in which the speech synthesis subsystem is implemented; and a client device coupled to the audio server and configured to assert the input speech data to said audio server. 30. The system of claim 29, wherein said speech in the first language is uttered by a user of the client device, and the user of the client device is a participant in one of a game and a chat room. 31. The system of claim 26, wherein the speech synthesis subsystem is configured to analyze input speech data indicative of the speech in the first language to generate the first data, said system including: an audio server in which the speech synthesis subsystem is implemented; and a client device coupled to the audio server and configured to assert the input speech data, with metadata indicative of at least one characteristic of the speech in the first language, to said audio server, and wherein the speech synthesis subsystem is configured to generate the scrambled speech data in response to the first data and the metadata. 32. The system of claim 31, wherein said speech in the first language is uttered by a user of the client device, and the user of the client device is a participant in one of a game and a chat room. 33. The system of claim 26, wherein the dictionary is an audio data file indicative of at least one exemplary utterance in the second language, and the grains are indicative of segments of the at least one exemplary utterance in the second language. 34. The system of claim 26, wherein the dictionary is an audio data file indicative of at least one exemplary utterance in the second language and at least one pitch- shifted version of the exemplary utterance in the second language, some of the grains are indicative of segments of the exemplary utterance in the second language, and other ones of the grains are indicative of segments of the pitch-shifted version of the exemplary utterance in the second language . 35. The system of claim 26, wherein the speech synthesis subsystem is configured to generate the scrambled speech data in response to the first data and metadata indicative of at least one characteristic of the speech in the first language. 36. The system of claim 35, wherein the metadata are indicative of at least one of timbre of the speech in the first language, pitch of the speech in the first language, a user- specified timbre for the speech in the second language, and a user-specified pitch for the speech in the second language. 37. The system of claim 26, wherein the speech synthesis subsystem is configured to generate the scrambled speech data by performing coded-domain synthesis, including by: retrieving pre-encoded grains, indicative of snippets of speech in the second language, from the memory in response to the first data; and concatenating the pre-encoded grains to generate the scrambled speech data. 38. The system of claim 37, wherein the pre-encoded grains have been pre-encoded including by compressing grains indicative of the snippets of speech in the second language. 39. The system of claim 26, wherein said system is a data processing system including a processor configured to implement the speech synthesis subsystem, and wherein the processor is coupled to the memory. 40. The system of claim 26, wherein said system is a data processing system including a general purpose processor programmed to implement the speech synthesis subsystem, and wherein the general purpose processor is coupled to the memory. 41. A system for concatenative synthesis, including: a memory in which a dictionary is stored, where the dictionary includes scrambled speech grains indicative of speech in a second language; and a speech synthesis subsystem, coupled to the memory and configured to analyze input speech data indicative of speech in a first language to generate grain identifiers indicative of locations in the dictionary of scrambled speech grains indicative of snippets of speech in the second language which correspond to a sequence of snippets of the speech in the first language, and to generate scrambled speech data indicative of speech in the second language using concatenative synthesis, including by retrieving the scrambled speech grains indicative of snippets of speech in the second language, from the memory in response to the grain identifiers, and concatenating the scrambled speech grains retrieved from the memory to generate the scrambled speech data. 42. The system of claim 41, wherein the second language is an unintelligible, synthetic language. 43. The system of claim 41, including a first device in which the speech synthesis subsystem is implemented, and also including: a second memory in which the dictionary is also stored; and a second device including a second speech synthesis subsystem, coupled to the second memory and configured to generate scrambled speech data indicative of speech in the second language using concatenative synthesis, including by retrieving the scrambled speech grains indicative of snippets of speech in the second language, from the second memory in response to the grain identifiers, and concatenating the scrambled speech grains retrieved from the second memory to generate the scrambled speech data. 44. The system of claim 43, also including: an audio server coupled to the first device and to the second device, and configured to forward the input speech data and the grain identifiers from the first device to the second device. 45. The system of claim 44, wherein the first device is a client device coupled to the audio server and the second device is another client device coupled to the audio server. 46. The system of claim 44, wherein said speech in the first language is uttered by a user of the first device, the user of the first device is a participant in one of a game and an online chat room in which at least one additional participant also participates, and the additional participant is a user of the second device. 47. The system of claim 44, wherein the audio server includes a scrambling matrix including ports, the ports include a first port to which the first device is coupled, a second port to which the second device is coupled, and at least one other port, and the scrambling matrix is configured to forward the input speech data and the grain identifiers from the first port to any selected subset of a number of different subsets of the ports. 48. The system of claim 41, wherein the speech synthesis subsystem is configured to generate metadata indicative of at least one characteristic of said speech in the first language, and to generate the scrambled speech data using concatenative synthesis in response to the metadata and the grain identifiers. 49. The system of claim 41, wherein the dictionary is an audio data file indicative of at least one exemplary utterance in the second language, and the scrambled speech grains are indicative of segments of the at least one exemplary utterance in the second language. 50. The system of claim 41, wherein said system is a data processing system including a processor configured to implement the speech synthesis subsystem, and wherein the processor is coupled to the memory. 51. The system of claim 41, wherein said system is a data processing system including a general purpose processor programmed to implement the speech synthesis subsystem, and wherein the general purpose processor is coupled to the memory. |
CONCATENATIVE SYNTHESIS
Cross Reference to Related Applications
This application claims priority to United States Provisional Patent Application No.
61/333,340 filed May 11, 2010, hereby incorporated by reference in its entirety.
Field of the Invention
The invention relates to systems and methods for scrambling speech using concatenative synthesis. Typical embodiments are systems and methods for scrambling speech data (indicative of speech in a first language uttered by a participant in a game or online chat room) using concatenative synthesis to generate scrambled speech data indicative of speech in a second language.
Background of the Invention
Throughout this disclosure, including in the claims, the term "speech" is used in a broad sense to denote audio signals perceived as a form of communication by a human being, a beast or an object (e.g., an object or character in a virtual world, said object or character being capable of uttering speech). Thus, "speech" determined or indicated by an audio signal may be audio content of the signal that is perceived as speech (e.g., dialog, monologue, or singing) upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer). Herein, speech "in a language" denotes speech perceived as being in a language, whether the language is intelligible or unintelligible to the listener.
Throughout this disclosure, including in the claims, the expression "speech data" denotes data indicative of speech, and the term "scrambling" speech data (or speech) denotes generating speech data indicative of speech in a second language (typically an unintelligible language) in response to speech in a first language (e.g., in response to speech data indicative of the speech in the first language).
Throughout this disclosure, including in the claims, the expression "snippet" of sound assumes that the sound (which may be speech) has a first duration, and denotes a segment of the sound having a second duration less than the first duration. For example, if the sound has a waveform of a first duration, a snippet of the sound has a waveform whose duration is shorter than the first duration.
Throughout this disclosure, including in the claims, the expression "grain" of audio data (e.g., speech data) assumes that the audio data is indicative of sound (e.g., speech) having a first duration, and denotes a subset of the audio data indicative of sound having a second duration less than the first duration. For example, if sound (indicated by audio data) has a waveform of a first duration, a grain of the audio data is indicative of sound having a waveform whose duration is shorter than the first duration.
Throughout this disclosure, including in the claims, the expression performing an operation "on" signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X - M inputs are received from an external source) may also be referred to as a decoder system.
Typical embodiments of the present invention are useful to scramble speech uttered by participants in a game or on-line chat room.
The language of non-playing characters (NPCs) and other AI constructs in simulation games and some role-playing games is often made up of nonsense sounds strung together like actual words. It is not a cipher, e.g., normal speech spoken backwards, but quite simply gibberish. An example of such an alien, unintelligible language is the iconic Simlish language developed for the successful franchise "The Sims."
Language synthesis became popular in cartridge and floppy-based releases before fully voiced CD-ROM releases began to appear. It can be seen as a compromise between the expression provided by voice acting and the enormous amount of data storage required by voice acting. It can also reduce the required budget for implementing a game while still improving the audience's immersion.
Some embodiments of the present invention apply language synthesis in a new context in which it had not previously been applied, i.e., for scrambling of a voice channel between a pair of game players (or chat room participants) so that they cannot understand each other (unless at least one of them employs an unscrambling system capable of unscrambling the scrambled speech that he or she receives). Typically, scrambling by concatenative speech synthesis in accordance with embodiments of the invention is not reversible, so that it is not possible (or not practical) to implement an unscrambling system. In an exemplary embodiment of the invention, two game players impersonating non-human creatures (e.g., an elf and an ore) would both speak English (and send speech data indicative of the English speech to a server), and the server would scramble the speech and send the resulting scrambled speech data back to players to cause each player to hear appropriate unintelligible (e.g., orcish and elvish) renderings of the input English voice streams. For example, each player would receive scrambled speech data from the server which, when reproduced by a loudspeaker or other sound-emitting transducer, would be heard as the scrambled, unintelligible renderings of the input voice streams.
Herein, the expressions "input voice stream," "input speech stream," and "input audio stream" are used synonymously in a broad manner to denote speech (e.g, speech uttered by a user of an embodiment of the inventive system) which may be scrambled in accordance with an embodiment of the invention, whether the speech is "live" or uttered in real time (i.e., not pre-recorded or significantly delayed), or pre-recorded, or delayed.
Concatenative synthesis approaches have been widely used in computer music or for the generation of sound textures. Conventional concatenative synthesis approaches aim at generating a meaningful macroscopic waveform structure from a large number of shorter waveforms. They typically use databases of grains (indicative of sound snippets) to assemble target phrases. In signal processing for music, music signals are typically decomposed into a small number of grains that retain some of the most salient features of the original audio data, and the grains are included in a database to be retrieved to perform concatenative music synthesis. Graphical interfaces have been implemented for real-time concatenative granular music synthesis in a manner allowing the user to model the sound grains and their distribution. It has been proposed to drive music synthesis based on a real-time audio-stream used as a source of timbral control.
Concatenative synthesis approaches have also been used in conjunction with a real- time speech input, e.g., for speech-driven control of musical instruments. Many conventional text-to-speech algorithms also implement a form of concatenative synthesis.
However, most concatenative synthesis algorithms use small grains (typically about 20 msec, long) and time domain processing, such as pitch-synchronous overlap add to generate their output. In the context of the present invention, the inventors have recognized that a simpler grain scheduling algorithm must be used, since it is typically impractical to perform any post-processing to pre-encoded grains that are retrieved and concatenated to generate speech in accordance with the invention.
There is a need for a method and system for interactive speech synthesis capable of generating synthesized speech that is perceived as unintelligible utterences in a second language in response to input speech uttered by a user in a first language, preferably with the synthesized speech activity (e.g., rhythm) matching as closely as possible the voice activity of the user who utters the input speech. A goal of typical embodiments of the present invention is to generate an unintelligible, scrambled version (e.g., gibberish "translation") of speech uttered by a user in a way that is compelling and efficiently implemented in game and chat room applications.
BRIEF DESCRIPTION OF THE INVENTION
In a class of embodiments, the inventive method includes steps of: (a) analyzing speech in a first language to generate first data indicative of a sequence of snippets of the speech; and (b) generating speech data (sometimes referred to herein as "scrambled" or "translated" speech data) indicative of speech in a second language (sometimes referred to herein as "scrambled" or "translated" speech) using concatenative synthesis in response to the first data. In preferred embodiments, steps (a) and (b) are performed in real time in response to real-time input speech (e.g., speech data indicative of speech in the first language asserted to a microphone). The second language is typically a synthetic language, unintelligible to the intended listener (e.g., an "alien" language not spoken in the real world by humans), and is preferably a natural sounding, non-repetitive, unintelligible synthetic language. An important application for the inventive method (and systems configured to implement the method) is to selectively scramble voices of participants in an online chat room or game so that they can either understand each other (in which case they hear the originally uttered speech) or not (in which case they hear the scrambled speech generated in accordance with the synthesizing step).
In typical embodiments in this class, step (b) includes the step of dynamically concatenating grains (indicative of audio snippets) that are read or otherwise selected from (i.e., "output" from) a dictionary in response to the first data. Typically, the first data are indicative of input speech from a chat room or game participant, and the first data are asserted (with or without metadata) to a speech synthesis system which retrieves the grains from the dictionary in response to the first data (and optionally also metadata) and concatenates the grains to generate the scrambled speech data. The dictionary can be an audio data file (typically a large audio data file) indicative of at least one exemplary utterance in the second language, and the grains are indicative of segments of the at least one exemplary utterance in the second language. Each exemplary utterence may be a word or sentence (or a longer utterence) in the second language. To generate the dictionary, each exemplary utterence is segmented into individually retrievable grains. In accordance with the invention, a sequence of grains is retrievable from the dictionary, in response to a corresponding input utterance (or sequence of input utterences) and optionally also "input metadata" (e.g., input metadata generated from or otherwise corresponding to the input utterence(s)). The input utterence (or sequence of input utterences) is typically an utterence by a user in the user's native language. The dictionary of independently retrievable grains (and the segmentation of "second language" utterences into the grains) is usually authored in a preliminary (pre- synthesis) step, for example, using traditional digital audio editing tools or other specific tools. Optionally, "output metadata" for each of all or some of the grains are also stored in the dictionary, such that a quantity (a set) of output metadata is retrievable from the dictionary with the corresponding grain. The output metadata are typically authored with the rest of the dictionary (e.g., in response to user-defined criteria). Sets of output metadata (which may comprise one or more phonemes/visemes, and/or one or more speech state, pitch, and/or intensity values), together with each grain corresponding to each of the sets, that are output from the dictionary in response to an input utterence determine the scrambled speech generated in response to the input utterence.
Typically, real-time selection of each grain to be output from the dictionary is performed in response both to input speech and input metadata, where the input metadata may be generated from (or may otherwise correspond to) the input speech, rather than in response to the input speech alone. The input metadata (e.g., pitch values) can be extracted in real-time from the input speech (e.g., in an analyzer in a client device), typically using user- defined rules. For example, the input metadata may indicate the timbre and/or pitch of the input speech (or a user- specified timbre and/or pitch for the output scrambled speech), and the scrambled speech synthesized in response to the input speech and input metadata may have its timbre and/or pitch determined by the input metadata (e.g., the scrambled speech's timbre may be matched to or otherwise determined by that of the input speech).
Preferably, synthesized speech generated in accordance with the invention is scrambled speech that is perceived by typical listeners as being in an alien (unintelligible) language but that matches the tone, pitch and timbre of the input speech uttered by a user. For instance, in response to input speech from a female game player (or a male player impersonating a female character through the use of a voice font), preferred embodiments of the invention generate scrambled speech that would be perceived by typical listeners as utterences by a matching female in an alien language. Herein, the expression "voice font" denotes a predefined set of parameters and associated signal processing techniques (typically, one selected from a number of such sets that are user- selectable) aimed at modifying the timbre and/or pitch of input speech while preserving its intelligibility (e.g., modifying input speech uttered by a person of a first gender so that the modified speech is apparently uttered by a person of different gender). The synthesized (alien language) speech preferably matches as closely as possible the voice activity of the user who utters the input speech to convey accurately the level of the original voice activity, and to avoid overloading the network over which it is transmitted.
Also preferably, to ensure that scrambled speech in a variety of different target languages can be generated (e.g., to match the target game world), it is desirable that speech synthesis implemented in accordance with the invention can be controlled by a game sound designer with no more than minimal setup effort.
Some embodiments of the inventive system are configured to perform coded-domain synthesis in which grains of scrambled speech are pre-encoded (e.g., compressed using a fixed frame codec), included in a dictionary in pre-encoded form, and then read out sequentially from the dictionary to generate scrambled speech consisting of a sequence of the pre-encoded grains. For example, during authoring of the dictionary, the grains may be pre- encoded in a manner which compresses them to generate encoded (compressed) grains. The encoded grains (which are fixed-length frames of data in some implementations) are then stored in the dictionary. Coded-domain synthesis using such encoded grains can be simple concatenation of the encoded grains (e.g., fixed-length frames) as they are read from the dictionary in response to input speech. By implementing coded-domain synthesis, the synthesis of scrambled speech can be efficiently implemented by an audio mixing server (without any decoding or encoding other than what is inherent in reading pre-encoded grains from a dictionary in response to input speech and asserting a sequence of the grains so read from the dictionary), relying on a remote client device to perform interactive capture of the input speech and optionally also "input metadata" extraction from captured input speech.
Some embodiments of the inventive system and method employ an audio server, preferably implemented in a simple, inexpensive manner, that is configured to forward speech data (e.g., compressed packets of speech data) from one or more client device(s) to one or more other client device(s), without decoding or otherwise processing the speech data. In some cases, such a server is configured to mix uncompressed voice streams from several client devices (including by decompressing the voice streams prior to mixing if they arrive in compressed form) and assert the mixed voice stream to client devices. A simply implemented server of this type can easily be configured to perform coded-domain synthesis in accordance with the invention. If the server must mix unscrambled (but compressed) speech streams from client devices with scrambled (but compressed) speech that it has generated using coded- domain synthesis, the server can be implemented to decode (decompress) the scrambled speech and the unscrambled speech streams (and then mix the resulting decompressed audio) at substantially the same cost as would be required to implement it to decompress unscrambled voice streams from the client devices and mix the decompressed voice streams.
In preferred embodiments in which an audio mixing server is configured to perform synthesis of scrambled speech in accordance with the invention, the server has input ports and output ports (for coupling via one or more networks to client devices) and is configured to perform selective switching of each speech stream received at one of the input ports from a client device, and each scrambled speech stream generated in the server, to selected ones of the output ports. For example, in response to receiving a speech stream (SI) at one input port and a speech stream (S3) at another input port, the server may generate a scrambled version (S2) of the SI stream and a scrambled version (S4) of the S3 stream, and selectively pass either the SI and S3 streams, or the S2 and S3 streams, or the SI and S4 streams, or the S2 and S4 streams to each output port. In some embodiments, a switching matrix in the audio server is controlled in response to game control data asserted thereto (e.g., by users). For example, a first user can send to the audio server (from a client device) a control code (e.g., a code indicative of money "earned" in a game) which causes the server to send back to the first user an unscrambled version of a speech stream from a second user, whereas the first user would receive only a scrambled version of the stream from the second user if the first user had not sent the control code to the server. In a variation on this example, the first game player can send to a game server (from a client device) a control code (e.g., a code indicative of money "earned" in the game), causing the game server to respond by causing the audio server to send back to the first player an unscrambled version of a speech stream from a second game player, whereas the first player would receive only a scrambled version of the stream from the second player if the first player had not sent the control code to the game server.
Typical scrambling methods employed to generate scrambled speech in accordance with the invention are not reversible, since they are not 1-to-l mappings of input speech snippets to output grains. For example, some such scrambling methods employ a limited amount of random selection of output grains. In cases in which a reversible scrambling method is employed to concatenate grains retrieved from a scrambling dictionary to generate scrambled speech in response to input speech in accordance with the invention, a recipient of the scrambled speech could employ an unscrambling dictionary to recover the input speech. For example, a client device which receives scrambled speech from an audio server (which had generated the scrambled speech in a reversible manner using a dictionary including grains of the scrambled speech) could employ an unscrambling dictionary (including unscrambled speech grains indicative of unscrambled speech snippets, that are readable in response to scrambled speech grains, so that the unscrambled speech grains read from the dictionary can be concatenated to generate intelligible speech) to recover the input speech in response to the grains of the scrambled speech.
In some embodiments, the grains that comprise the dictionary (employed for scrambled speech synthesis) are not compressed or otherwise pre-encoded. Scrambled speech synthesis using such non-encoded grains can include compression of the grains that are read from the dictionary in response to input speech, to generate compressed, scrambled speech (e.g., the compressed, scrambled speech may comprise fixed-length frames which are indicative of the compressed grains). Alternatively, the scrambled speech synthesis can include other processing of grains that are read from the dictionary in response to input speech.
Scrambled speech synthesis in accordance with preferred embodiments of the inventive method and system has all or some of the following features: the ability to perform pair-wise voice scrambling in a multi-participant situation (e.g., so that at least one of the participants receives a scrambled version of input speech uttered by a participant, but another one of the participants receives the original, unscrambled input speech); the ability to perform concatenative synthesis directly in a coded domain to avoid a decode (or recode) step; the ability to perform concatenative synthesis in a server in response to input speech (or input speech data with input metadata) received from each of one or more client devices; the ability to control the timbre of output (scrambled) speech as a function of the timbre of the voice producing the input speech (e.g. to match the timbres of the input and output speech); and the ability to perform synthesis and mixing of scrambled and unscrambled voice streams in a distributed chat server.
Exemplary applications of typical embodiments of the invention for synthesizing scrambled speech in response to input speech uttered by game or chat room participants include:
selective assertion of scrambled speech, or the unscrambled input speech from which it is synthesized, to game (or chat room) participants. In some cases, assertion of unscrambled (rather than scrambled) speech to each participant is enabled or disabled by specific virtual items which can be monetized in a game; guild or team-based scrambling (e.g., all users who do not belong to a team receive scrambled speech that has been synthesized in response to input speech uttered by a member of the team, but each member of the team receives the unscrambled input speech rather than the scrambled speech);
use of speech scrambling with voice fonts to disguise a participant's voice; and ensuring auditory consistency between single player and multiplayer game modes by re-using audio assets from AI characters (characters generated using artificial intelligence techniques) or non-playing characters that appear in a game's single player mode to scramble the voices of competing players in multiplayer mode (e.g., utterences of an AI character in the single player mode are segmented to generate a dictionary of grains of synthetic speech in an unintelligible language, and the dictionary is used in the multiplayer mode to synthesize scrambled speech in response to player utterences).
Other embodiments of the invention implement any of various types of concatenative synthesis (e.g., client/server based concatenative synthesis) to generate synthesized sound in response to control signals or commands (rather than in response to input speech). For example, the synthesized sound can be a non-repetitive ambient soundscape/sound texture (e.g., to be played during a game). Some such embodiments dynamically stream a synthesized audio scene to a mobile client device with limited memory resources. Also, while it is contemplated that typical embodiments will be designed to synthesize unintelligible speech, such embodiments can easily be modified (or variations on such embodiments can easily be produced) to synthesize intelligible sentences in response to specific control data sequences (e.g., by providing such a control data sequence from a client device to a server to cause the server to perform the synthesis). Each such control data sequence can indicate a desired list of grains, or can cause the synthesis to be driven by a game server which is aware of the global state of a game. In some embodiments, the synthesis is performed by an audio server. For example, the latter embodiments may be useful for streaming a "live" synthesized speech input (e.g., a live sports commentary for a sports game) to a client device with limited memory or physical-media streaming capabilities (e.g., to a mobile console or phone).
In accordance with some embodiments of the inventive method, a voice chat system (e.g., an in-game voice chat system) performs real-time, speech-driven, language synthesis. For example, the system may automatically replace a user's input speech with a matching alien language (scrambled) output. It is desirable that this language scrambling feature can be controlled on a pair- wise basis by pairs of users, who may belong to a common team or guild (e.g., so that members of the same team or guild can understand each other but other users cannot understand these team or guild members). This approach can also help to achieve a consistent experience in both single-player and multi-player modes for games supporting both modes. Also, speech synthesis (scrambling) in accordance with the invention can be an element of additional voice-related items (e.g., translators) or game-play elements
(development of language skills) that could be monetized in games.
Some embodiments of the invention are implemented by a voice server, to enable the server to synthesize speech (e.g., speech in unintelligible alien languages) efficiently and in a manner that is fully controllable by a game sound designer.
In another class of embodiments, scrambled speech synthesis is performed in accordance with the invention by an audio server and at least one client device coupled
(typically by at least one network) to the server, where the server is configured to assert (e.g., stream) to each client device (or to each of selected client devices) grain identifiers indicative of grains of scrambled speech, but not to assert the grains themselves to any client device. In typical ones of such embodiments, a dictionary of grains for use in concatenative synthesis of scrambled speech (e.g., scrambled speech in an unintelligible alien language) is stored in memory in (or local to) each client device, rather than in memory in (or local to) the audio server. Each client device is configured to analyze input speech, to select an appropriate grain identifier indicative of a corresponding grain of scrambled speech stored in the dictionary (e.g., a grain identifier indicative of a stored grain of speech in an alien language) for each of a sequence of analyzed snippets of the input speech, and to assert both the input speech stream and the selected grain identifiers (and typically also metadata associated with the grain identifiers) to the audio server. The grain identifiers and associated metadata typically function as a very low bitrate encoding of scrambled speech to be generated in response to the input speech.
The audio server includes a subsystem (referred to herein as a "scrambling matrix") which includes the audio server's input ports and output ports. Each of these ports corresponds to one of the client devices in communication with the audio server, and the scrambling matrix is configured to route both the input speech and the grain identifiers (and metadata) from each client device from the relevant input port to all (or an appropriate subset) of the output ports.
Since grains that comprise scrambled speech are not themselves available to the audio server, the audio server does not mix any scrambled speech stream with other speech streams (either scrambled or unscrambled streams). The audio server in this class of embodiments functions as a forwarding bridge; not a mixing bridge. Since the grain identifiers and associated metadata can typically be sent at a very low bitrate (typically less than 1 Kbit/sec.) compared to the bitrate of a typical voice stream, a simple (and inexpensive) implementation of the audio server can typically assert identifiers (optionally including metadata) for all scrambled speech streams independently to each destination client device. For instance, in a short time window, a client device might receive from the server one or several packets of compressed intelligible (unscrambled) speech data as well as one or several packets of encoded scrambled (unintelligible language) speech data, each of the packets of encoded scrambled speech data comprising a grain identifier (and optionally also associated metadata) indicative of a grain of scrambled speech.
In response to reception of a scrambled speech packet containing a grain identifier and associated metadata (but not scrambled speech), each client device is configured to synthesize the corresponding scrambled speech audio signal using the dictionary resident in its memory. The synthesis typically includes a step of reading a corresponding sequence of grains of the scrambled speech from the dictionary in response to a sequence of grain identifiers and associated metadata indicated by a sequence of packets (or decompressing a sequence of packets of grain identifiers and associated metadata, in cases in which the packets are received in compressed form, and then reading a corresponding sequence of grains of the scrambled speech from the dictionary in response to the decompressed grain identifiers and associated metadata). In contrast with server-side implementations of speech scrambling, it is possible (in client-side implementations of speech scrambling) for a client device to perform additional processing (e.g., pitch correction) at the time of synthesizing scrambled speech since the data identifying each scrambled speech stream are never mixed with other streams received by the client, at the expense of a requirement for more processing power for each client device. This class of embodiments would typically require more processing and memory on the client side than embodiments that include server- side implementations of speech scrambling.
In variations on this class of embodiments of the inventive system, each client device is configured to indicate to the audio server which dictionary (or dictionaries) it has locally, and the audio server includes memory in which all the dictionaries are stored. Each dictionary to be used is in (or available locally to) the audio server, and is optionally also in (or available locally to) one or more of the client devices. The audio server is also configured to:
send (e.g., forward from a client device, or generate internally and then assert) grain identifiers and optionally also associated metadata, to each client device that has a local dictionary useful for synthesizing scrambled speech in response to the grain identifiers and any associated metadata; and
send (i.e., generate internally and then assert) scrambled speech (sequences of grains read from a dictionary local to the audio server) to each client that does not have a local dictionary useful for synthesizing the scrambled speech in response to grain identifiers (or grain identifiers and associated metadata).
In typical embodiments, the inventive system is or includes a general or special purpose processor programmed with software (or firmware) and/or otherwise configured to perform an embodiment of the inventive method. In some embodiments, the inventive system is or includes a general purpose processor, coupled to receive input data indicative of input speech (or grain identifiers) and optionally also metadata, and programmed (with appropriate software) to generate to generate (by performing an embodiment of the inventive method) output data indicative of scrambled speech (or grain identifiers) and optionally also metadata. In other embodiments, the inventive system is implemented by appropriately configuring (e.g., by programming) a configurable audio digital signal processor (DSP).
Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an embodiment of the inventive system.
FIG. 2 is an exemplary waveform of concatenated grains, where the grains are of a type stored in the dictionary implemented in a class of embodiments of the inventive system. FIG. 2 shows the boundaries between the grains, and four of the grains are identified with labels (grains A, B, C, and D).
FIG. 3 is a block diagram of another embodiment of the inventive system.
FIG. 4 is a block diagram of a computer system, including a computer readable storage medium 504 which stores computer code for programming the system to perform an embodiment of the inventive method.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Embodiments of the inventive system, method, and medium will be described with reference to FIGS. 1, 2, 3, and 4. FIG. 1 is a block diagram of an embodiment of the inventive system which includes audio mixing server 1, memory 3 coupled to server 1, game server 5 coupled to server 1, and a set of N client devices CI, C2, ..., CN (where N is an integer). Each of devices CI, C2, CN is coupled to audio server 1 via a network (e.g., a local network or the internet). More specifically, input port II and output port 01 of server 1 are coupled to device CI, and input port Ii and output port Oi of server 1 are coupled to device Ci, for each index "i" in the range 1 < i≤N.
Servers 1 and 5 are configured to execute a game (and/or to implement a chat room) in response to input data (including speech data) and commands from one or more of client devices CI, C2, ..., CN (e.g., in response to input data and commands from one of the client devices in a single-player mode of a game, or from all of the client devices in an N-player mode of a game), including by asserting audio, text, and video data via the network(s) to the client devices. Audio server 1 can be implemented in a simple, inexpensive manner, and includes switching subsystem 7 configured to forward speech data (e.g., compressed packets of speech data) from each of the client devices to each of all or some of the client devices, without decoding or otherwise processing the speech data. Audio server 1 is configured to mix uncompressed voice streams received at all or some of input ports Il-IN from two or more of client devices Cl-CN (in some implementations, including by decompressing the voice streams prior to mixing if they arrive in compressed form) and to assert the mixed voice stream to all or some of output ports Ol-ON (for transmission to all or some of client devices Cl-CN).
Audio server 1 also includes scrambled speech synthesis subsystem 9, which is coupled to subsystem 7 and to memory 3 as shown, and configured to perform coded-domain scrambled speech synthesis in accordance with an embodiment of the present invention. In preferred implementations, subsystem 7 of server 1 is configured to forward unscrambled (but compressed) speech streams from one or more of client devices Cl-CN with scrambled (but compressed) speech generated by subsystem 9 using coded-domain synthesis. In some implementations, server 1 (e.g., subsystem 7 thereof) is configured to decode (decompress) scrambled speech generated by subsystem 9 and unscrambled speech streams received at one or more of input ports Il-IN from one or more of client devices Cl-CN, and to mix the resulting decompressed audio and assert the mixed decompressed audio to one or more of output ports Ol-ON (for transmission to one or more of client devices Cl-CN).
Many conventional game systems have the same architecture as the FIG. 1 system, but do not include scrambled speech synthesis subsystem 9 or any other means for scrambling speech in accordance with the present invention. Such conventional game systems include client devices (installed at users' locations), an audio mixing server in
communication with the client devices over a network (e.g., the internet), and typically also a game server in communication with the audio mixing server. The game server and audio mixing server could both be implemented (as could game server 5 and audio server 1 of FIG. 1) in software that runs on a single computer, or the game server could be implemented (as could game server 5 of FIG. 1) as an appropriately programmed computer and the audio mixing server (as could audio server 1 of FIG. 1) as another appropriately programmed computer. The latter two computers, which could be located in the same facility or in different locations, would typically communicate with each other over a network (e.g., the internet). In operation of some conventional game systems, speech data (e.g., compressed speech data) indicative of a user' s utterance is sent from a client device to the audio mixing server. In response, the audio mixing server can process the speech data to change the timbre or pitch of the utterance and assert speech data (e.g., compressed speech data) indicative of the altered speech to each client device in communication with the audio mixing server.
In a class of embodiments of the inventive system, a system having the conventional architecture described in the previous paragraph is modified in accordance with the invention to configure the audio mixing server to perform speech synthesis in accordance with any embodiment of the inventive method in response to speech from one or more of the client devices (to scramble the speech) while allowing pair-wise control of the speech synthesis
(e.g., selective forwarding to each client device of either the original, unscrambled speech or the scrambled speech). Reliance on an appropriately configured audio mixing server to perform the speech scrambling constrains the processing capabilities typically available (in the audio mixing server) to implement the speech synthesis, since the audio mixing server typically must (as a practical matter) be a fairly simple device having limited capabilities but including the including the ability to forward packets of compressed audio data (sometimes referred to as "coded voice packets") between client devices without modifying the packets (e.g., in the frequent case that only few active talkers are present in a scene or chat room). Thus, if such an audio mixing server is configured to perform speech scrambling in accordance with the invention, the speech synthesis algorithm preferably generates packets of compressed audio data (indicative of synthesized utterences, e.g., in an "alien" language) efficiently (e.g., by concatenating packets of compressed audio data, where each packet is indicative of a compressed audio data grain and each packet is retrieved from a stored dictionary) for routing to the client devices in place of input packets of compressed audio data.
With reference again to FIG. 1, subsystem 7 of audio server 1 is configured to perform selective switching of each speech stream received at one of input ports II -IN from a client device, and each scrambled speech stream generated in subsystem 9, to selected ones of the output ports 01 -ON. For example, in response to receiving a speech stream (SI) at input port 01 and a speech stream (S3) at input port 02, subsystem 9 may generate a scrambled version (S2) of the SI stream and a scrambled version (S4) of the S3 stream, and subsystem 7 may selectively pass either the SI and S3 streams, or the S2 and S3 streams, or the SI and S4 streams, or the S2 and S4 streams to each of output ports Ol-ON. In some embodiments, subsystem 7 is a switching matrix controlled in response to game control data asserted thereto from client devices Cl-CN (or from game server 5 in response to control data or commands from devices Cl-CN). For example, a first user (of device CI) may send to server 1 a control code (e.g., a code indicative of money "earned" in a game) which causes server 1 to assert to port 01 (for transmission to device CI) an unscrambled version of a speech stream from a second user (of device C2), whereas server 1 would assert to port Ol (for transmission to device CI) only a scrambled version of the speech stream from the second user if the first user had not caused device CI to send the control code to server 1.
Subsystem 9 of server 1 is configured to synthesize scrambled speech including by: (a) analyzing speech data (received from any of devices Cl-CN) indicative of speech in a first language to generate first data indicative of a sequence of snippets of the speech; and (b) generating scrambled speech data (sometimes referred to herein as "translated" speech data) indicative of speech in a second language using concatenative synthesis in response to the first data. In preferred embodiments, subsystem 9 is configured to perform steps (a) and (b) in real time in response to real-time input speech (e.g., speech data indicative of speech in the first language asserted to a microphone local to any of devices Cl-CN). The second language is typically a synthetic language, unintelligible to the intended listener (e.g., an "alien" language not spoken in the real world by humans), and is preferably a natural sounding, non- repetitive, unintelligible synthetic language. In operation, subsystem 9 may scramble speech from selected participants in an online chat room or game (speech uttered by the participants that is indicated by speech data from one or more of devices Cl-CN passed by subsystem 7 to subsystem 9) so that the recipients of the speech data asserted by server 1 can either understand each other (in which case they hear the originally uttered speech) or not (in which case they hear scrambled speech generated by subsystem 9 in accordance with the synthesizing step).
Subsystem 9 is configured to dynamically concatenate grains (indicative of audio snippets) output from dictionary 3A (which is stored in memory 3) in response to the first data. Typically, the first data are indicative of input speech from a chat room or game participant, the first data are asserted (with or without metadata) to subsystem 9, and in response to the first data (and optionally also metadata), subsystem 9 retrieves the grains from dictionary 3A and concatenates the retrieved grains to generate the scrambled speech data.
Dictionary 3 A is an audio data file (typically a large audio data file) containing audio data indicative of exemplary output utterances in the second language. The exemplary output utterences may be words or sentences (or longer utterences) in the second language. To generate dictionary 3A, each output utterence is segmented into individually retrievable grains.
FIG. 2 is a waveform of an exemplary output utterance in the "second language." The utterence consists of concatenated grains, and FIG. 2 shows the boundaries between these grains. For example, boundary R separates grain A from the adjacent grain B, boundary S separates grain B from the adjacent grain C, and boundary T separates grain C from the adjacent grain D). Data indicative of grains A, B, C, D, and the other grains of the FIG. 2 waveform are stored in some implementations of dictionary 3A. The segmentation (i.e., location of grain-separating boundaries) indicated in FIG. 2 was determined manually, but it would be possible to automate the process or provide guidelines for doing so (e.g., including by tracking the spectral flux of the signal).
In accordance with typical embodiments of the invention, a sequence of grains (prestored in dictionary 3A) is retrievable from dictionary 3A in response to a corresponding input utterance (or sequence of input utterences), or in response to an input utterence (or sequence of input utterences) and "input metadata" generated from or otherwise
corresponding to the input utterence(s). The input utterence (or input utterence sequence) is typically an utterence by a user of one of client devices Cl-CN in the user's native language. Dictionary 3A is usually authored (and the segmentation of "second language" utterences into grains is usually performed) in a preliminary (pre-synthesis) step, for example, using traditional digital audio editing tools or other specific tools. Optionally, "output metadata" for each of all or some of the grains are also included in dictionary 3A and stored in memory 3 such that a quantity (a set) of output metadata is retrievable from dictionary 3A with the corresponding grain. The output metadata are typically authored with the rest of dictionary 3A (e.g., in response to user-defined criteria). Sets of output metadata (which may comprise one or more phonemes/visemes, and/or one or more speech state, pitch, and/or intensity values), together with each grain corresponding to each of the sets, that are output from dictionary 3A in response to an input utterence determine the scrambled speech generated in response to the input utterence.
Typically, subsystem 9 performs real-time selection of each grain to be retrieved from dictionary 3A in response both to input speech and input metadata generated from (or otherwise corresponding to) the input speech, rather than in response to the input speech alone. The input metadata (e.g., pitch values) can be extracted in real-time from the input speech (e.g., in an analyzer in any of client devices Cl-CN), typically using user-defined rules. For example, the input metadata may indicate the timbre of the input speech (or a user- specified timbre for the output scrambled speech), and the scrambled speech synthesized in response to the input speech and input metadata may have its timbre determined by the input metadata (e.g., the scrambled speech's timbre may be matched to or otherwise determined by that of the input speech).
As noted, audio server 1 is configured to perform coded-domain scrambled speech synthesis. During authoring of dictionary 3 A, grains of scrambled speech are pre-encoded (e.g., compressed using a fixed frame codec) and included in dictionary 3A (and stored in memory 3) in this pre-encoded form. Then, during performance of coded-domain synthesis, server 1 reads out the grains sequentially from memory 3 to generate (in subsystem 9) scrambled speech consisting of a sequence of the pre-encoded grains. The coded-domain scrambled speech synthesis is thus a simple concatenation of the pre-encoded grains (e.g., fixed-length frames of compressed data) that are read from dictionary 3 A in response to input speech. By implementing coded-domain synthesis, audio mixing server 1 can efficiently perform concatenative synthesis of scrambled speech (without any decoding or encoding other than what is inherent in reading pre-encoded grains from dictionary 3A in response to input speech and asserting a sequence of the grains so read from the dictionary to all or some of ports Ol-ON), relying on client devices Cl-CN to perform interactive capture of the input speech and optionally also "input metadata" extraction from captured input speech.
Typically, subsystem 9 is configured to perform speech scrambling in a manner that is not reversible, in the sense that its mapping of input speech snippets to output (scrambled speech) grains is not a 1-to-l mapping. For example, subsystem 9 may be configured to employ a limited amount of random selection of output grains from dictionary 3A in response to input speech snippets (e.g., to read from dictionary 3A a specific grain in response to each of a first subset of a sequence of input snippets, and to read from dictionary 3A a randomly determined grain in response to each of a different subset of the sequence of input snippets).
When subsystem 9 implements a reversible scrambling method to generate scrambled speech by concatenating grains retrieved from dictionary 3A in response to input speech snippets, a recipient of the scrambled speech can employ an unscrambling dictionary to recover the input speech. For example, if client device CI is to receive the scrambled speech from server 1 (which had generated the scrambled speech in a reversible manner), client device CI can be configured to include an unscrambling subsystem employing an unscrambling dictionary that includes nonscrambled speech grains (indicative of
nonscrambled speech snippets) that are readable in response to grains of the scrambled speech. The unscrambling subsystem would be configured to retrieve (from the unscrambling dictionary) nonscrambled speech grains in response to a sequence of grains of scrambled speech, and to concatenate the retrieved nonscrambled speech grains to generate intelligible speech (to recover input speech which had been scrambled in server 1 to generate the sequence of scrambled speech grains).
In some embodiments, the grains that comprise dictionary 3A are not compressed or otherwise pre-encoded. In such embodiments, subsystem 9 can be configured to perform scrambled speech synthesis using the non-encoded grains, including by performing compression on grains that are read from dictionary 3 A in response to input speech, to generate compressed, scrambled speech (e.g., the compressed, scrambled speech may comprise fixed- length frames which are indicative of the compressed grains). Alternatively, subsystem 9 can be configured to perform scrambled speech synthesis including by performing other processing of non-encoded grains that are read from dictionary 3 A in response to input speech.
Audio mixing server 1 is preferably configured to have all or some of the following features: the ability to perform pair- wise voice scrambling in a multi-participant situation (e.g., subsystems 7 and 9 are configured to scramble input speech data received at input port II from device CI, and to assert to each of output ports Ol-ON either: a mix of the resulting scrambled speech data and speech data received at input ports Γ2-ΓΝ, or a mix of the unscrambled speech data from device CI and speech data received at ports I2-IN); the ability to perform concatenative synthesis in response to input speech (or input speech data with input metadata) received from each of one or more of client devices Cl-CN; the ability to control the timbre of output (scrambled) speech as a function of the timbre of the voice producing the input speech (e.g. subsystem 9 is configured to scramble input speech data received, with input metadata indicative of timbre, at input port II from device CI, to generate scrambled speech data indicative of scrambled speech having timbre matching the timbre of the input speech as indicated by the input metadata); and the ability to perform synthesis and mixing of scrambled and unscrambled voice streams in a distributed chat server (e.g., subsystems 7 and 9 are configured to scramble speech data received at input port II which are indicative of speech uttered by a participant in a chat room, and to assert to each of output ports 01 -ON a mix of the resulting scrambled speech data and speech data received at input ports I2-IN which are indicative of speech uttered by other participants in the chat room).
In operation of the FIG. 1 system to implement a game or chat room, audio mixing server 1 may perform pairwise selective assertion (at output ports 01 -ON) of either scrambled speech or the unscrambled input speech from which it is synthesized, to client devices Cl-CN for reproduction as sound that is audible to the game or chat room
participants. During some games, subsystem 7's assertion of unscrambled (rather than scrambled) speech to each participant can be enabled or disabled by control data (asserted by devices Cl-CN to subsystem 7) that are indicative of specific virtual items (e.g., items which can be monetized in the game).
In operation of the FIG. 1 system to implement a game or chat room, audio mixing server 1 may perform guild-based or team-based speech scrambling. For example, subsystem 7 may assert (to all devices Cl-CN whose users do not belong to a team) scrambled speech that subsystem 9 has synthesized in response to input speech uttered by a member of the team, but subsystem 7 asserts (to all devices Cl-CN whose users do belong to a team) the unscrambled input speech rather than the scrambled speech.
In operation of the FIG. 1 system to implement a game or chat room, audio mixing server 1 may implement speech scrambling with a voice font to disguise a participant' s voice. For example, subsystem 7 may assert (to all or some of ports Ol-ON) scrambled speech that subsystem 9 has synthesized in response to input speech uttered by a user of device CI, where the scrambled speech is a disguised version of the user's voice that has been disguised in accordance with a voice font selected by the user.
In operation of the FIG. 1 system to implement a game, audio mixing server 1 may perform speech scrambling in a manner that ensures auditory consistency between single player and multiplayer game modes, by re-using audio assets from AI characters that appear in the game's single player mode to scramble utterences of players competing in the game's multiplayer mode (e.g., dictionary 3A includes grains that have been generated by segmenting utterences of an AI character in the single player mode, and subsystem 9 accesses the grains from dictionary 3A during the multiplayer mode to synthesize scrambled speech in response to players' utterences).
Alternatively, subsystem 9 of audio mixing server 1 may perform concatenative synthesis to generate synthesized sound (sound data) in response to control signals or commands asserted thereto from any of devices Cl-CN (rather than in response to input speech data asserted thereto from devices Cl-CN). For example, the synthesized sound data can be indicative of a non-repetitive ambient soundscape/sound texture, and subsystem 7 can assert the synthesized sound data to devices Cl-CN for reproduction during a game. In the latter example, server 1 may dynamically stream synthesized data indicative of a synthesized audio scene to client devices Cl-CN, where each of the client devices is implemented as a mobile client device having limited memory resources.
While it is contemplated that typical embodiments of server 1 are configured to synthesize unintelligible speech in response to input speech, such embodiments can easily be modified (or variations on such embodiments can easily be produced) to synthesize sound indicative of intelligible sentences in response to specific control data sequences (e.g., in response to a control data sequence asserted from one of client devices Cl-CN to subsystem 9 of server 1). Each such control data sequence can indicate a desired list of grains to be retrieved by subsystem 9 from dictionary 3A to determine the synthesized sound data, or can cause subsystem 9 to perform concatenative synthesis in a manner driven by game server 5 (where server 5 is aware of the global state of a game) to generate the synthesized sound data. For example, the synthesized sound data can be indicative of a "live" synthesized speech input (e.g., a live sports commentary for a sports game) that is streamed from server 1 to client device CI, in the case that device CI has limited memory or physical-media streaming capabilities (e.g., where device CI is a mobile console or phone).
In operation, some embodiments of the FIG. 1 system implement a voice chat system (e.g., an in-game voice chat system) in which server 1 performs real-time, speech-driven, language synthesis. For example, the system may automatically replace input speech uttered by a user of client device CI with a matching alien language (scrambled) output. It is desirable that this language scrambling feature can be controlled on a pair- wise basis by pairs of users of devices Cl-CN, who may belong to a common team or guild (e.g., so that members of the same team or guild can understand each other but other users cannot understand these team or guild members). This approach can also help to achieve a consistent experience in both single-player and multi-player modes for games supporting both modes. Also, speech synthesis implemented by server 1 can be an element of additional voice-related items (e.g., translators) or game-play elements (development of language skills) that could be monetized in games.
In typical embodiments, the inventive system is configured to perform interactive language synthesis, including client-side metadata extraction (e.g., performed on input speech data by client device CI, C2, ..., or CN of FIG. 1) and server-side synthesis (e.g., performed by audio server 1 of FIG. 1 on input speech and optionally also metadata from any of client devices CI, C2, ..., and CN) which uses both the extracted metadata and a definition of a synthetic language (e.g., as determined by dictionary 3A, and/or subsystem 9 and dictionary 3 A, of FIG. 1).
In these embodiments, the client device (client) is responsible for analyzing the input voice stream and providing the server with suitable metadata to drive the language synthesis. The metadata can include: speech type (voiced, unvoiced, transient, silence), average pitch, normalized instantaneous pitch and level. These metadata are typically transmitted to the server as part of the input audio voice stream.
At launch time, the server typically loads all the synthetic language definition files and grain dictionary (or dictionaries) specified for a game. The server may then use the metadata and rules specified with such a dictionary, as well as the metadata extracted from the input speech to retrieve a selected sequence of suitable synthesized language frames from the dictionary and replace a sequence of input speech frames by a (concatenated) sequence of the synthesized language frames. When the grains stored in the dictionary are pre-encoded (including by being compressed), the synthesized language frames can be directly forwarded to the client, thus replacing the original speech frames. In the case where the server must mix several speech streams, the synthesized language frames are decoded prior to mixing with other input speech frames.
In a first off-line stage, the game sound designer can author a variety of alien languages that the interactive synthesis system will later use. Each language typically comprises a dictionary and a language definition file. The dictionary can be a PCM audio file containing a number of words and sentences (or other constructs in the alien language) that define the alien language. The language definition file segments the dictionary into atomic elements ("grains") that will be triggered (selected and retrieved) by the interactive synthesis system. Each grain can contain additional metadata such as pitch, intensity level (e.g., normal, whisper, and shouting), speech-type (voiced, unvoiced, transient) or description flag indicating whether it should begin or end a sentence. Most such metadata can be
automatically extracted by pre-processing the alien language waveform(s) that are stored (in segmented form) in the dictionary, thus requiring limited manual effort.
Additional rules can also be provided as part of the language definition file to control the probability of appearance of each grain in the synthesized speech.
We next describe an exemplary concatenative speech synthesis algorithm employed in some embodiments of the invention, and some specific controls that can be used during the synthesis.
The exemplary synthesis algorithm preserves to the extent practical the characteristics of the input language (e.g., the language in which the user uttered the input speech to be scrambled). As a result, the synthesized language frames (which determine the grains retrieved from the dictionary) do not follow the structure of the input speech at a fine-grain level. Instead, grains from the dictionary are selected "asynchronously" from the current speech state. However, at the time when a new grain must be selected by the algorithm, the selection process makes use of the current input speech state. The exemplary algorithm is as follows:
read the input voice state (including by determining the current input voice pitch as indicated by metadata);
if the current input voice state is silence, select a null grain;
if the current input voice state is not silence, select a grain based on the current input voice state and any current grain type enforcing voice- activity constraints (If the current input voice state is a transition between non-silence and silence, select a predetermined "transition" grain or grain sequence having the proper target pitch and indicative of an utterance beginning transition in the synthesized language. If the current input voice state is the end of a transition, select a predetermined "end of transition" grain or grain sequence (e.g., a sequence ending with a null grain) having the proper target pitch and indicative of an utterence ending transition in the synthesized language. If the current input voice state is not a transition, select a matching grain from the dictionary having the proper target pitch;
retrieve the current frame of the currently selected grain and increment the frame count (repeat this operation until all frames of the currently selected grain are retrieved, and then return to the start of the algorithm).
Typically, the selection process includes searching of the dictionary for grains that match the desired type and pitch. Typically, a "potential grain" set comprising more than one grain satisfies each set of constraints, in which case the grain actually selected is a grain (in the potential grain set) that has not been previously selected more recently than each other grain in the potential grain set (e.g., as determined using a global counter variable for each grain). To add randomness, the global counter can be altered in several ways, e.g., when changes occur in the voice state of the input speech. This will generate a pseudo-random sequence that is linked to input voice activity and will avoid a more costly random number generation.
To obtain a more compelling synthesis, it is possible to link certain grains to the input voice activity state. For instance, when the input speech data indicates a transition from silence to speech, grains corresponding to beginnings of sentences should preferably be selected. Similarly, when the input speech data indicates a transition from speech to silence, grains corresponding to sentence terminations should be triggered. Metadata describing whether each grain in the dictionary corresponds to a sentence start or end or continuous speech can be either automatically generated (or provided by a user) during authoring of the dictionary. Typically, some direct transitions between grains (for example, from a
"continuous speech" grain to a "sentence start" grain) are not allowed. Instead, a "sentence end" grain will be retrieved at the end of a sequence of "continuous speech" grains to better preserve the prosody of the input speech language. As a result, a local discrepancy may appear between actual voice activity (e.g., as indicated by an activity flag transmitted with the input speech data from a client device) and the voice activity of the synthesized speech generated in response to the input speech data.
To avoid generating a discontinuous sequence, the synthesis preferably tries as much as possible to concatenate the retrieved grains in the order they are retrieved from the dictionary. However, to also enable some variation to occur, the order may be modified as a function of the input voice characteristics, for instance when the state of the input speech changes (e.g., switches from "voiced" to "unvoiced").
To guarantee the continuity of the synthesized speech and avoid noticeable popping artifacts, it is desirable to include blending between two successively triggered (retrieved) grains. Although the encoding mechanism may already include a blending between reconstructed frames (e.g., overlap add reconstruction), this blending between reconstructed frames might not be sufficient to generate a signal perceived as continuous. A solution to this problem in (audio) texture synthesis is to analyze the input grains and find best matching transitions. Grains can then be chosen according to a transition probability which avoids triggering of discontinuous grains. However, when starting with a limited set of grains, this may quickly lead to repetition. An alternative solution is to pre- synthesize transition grains for each possible pair of original grains that can be concatenated together, and to include the transition grains with the other grains in a dictionary to be employed in accordance with the present invention for concatenative scrambling. This approach is similar in spirit to use of Wang tiles in computer graphics, where tiles with predefined continuity constraints are created during authoring and can then be directly concatenated. Automatically generating transition grains during authoring of the dictionary reduces authoring time and allows for maximizing the randomness of the grain selection process during scrambled speech synthesis leading to less repetitive results while ensuring good quality. A simple way to generate transition grains is simply to cross-fade the beginning and end of the two original grains over a pre-defined number of frames. Other interpolation techniques (e.g., based on harmonic tracking and interpolation) could also be used.
One important aspect of language synthesis is being able to match the input speech voice characteristics, e.g., pitch and timbre. These characteristics can be directly measured on the input voice stream or derived from a target voice font specified by the user (if the client device allows voice modification of the input speech). For example, the synthesized language track can be generated to match the gender of the locutor. This can be achieved by measuring and sending to the server (as metadata with the input speech) the average pitch Pi of the input speech. Assuming the average pitch Pd of the dictionary is known, it is possible to generate target synthesized frames with an output pitch Pi by pitch- shifting the original dictionary frames by a factor Pi/Pd.
If processing of compressed audio frames stored in the dictionary is not possible, modified copies of a "raw" dictionary that are pitch-shifted by different values of Pi/Pd, could be included in a final version of the dictionary. Thus, the final version of the dictionary would be an audio data file indicative of at least one exemplary utterance (in the synthetic, "alien" language) and at least one pitch-shifted version of the exemplary utterance, some of the grains retrievable from the dictionary would be indicative of segments of the exemplary utterance (in the synthetic language), and other ones of the grains retrievable from the dictionary would be indicative of segments of the pitch-shifted version of the exemplary utterance.
The synthesis algorithm could then directly select frames (indicative of grains) from the "raw" dictionary or an appropriate pitch-shifted version of the "raw" dictionary. This approach can be generalized to a larger set of control parameters including pitch, timbre and intensity by including in the final version of the dictionary different sub-dictionaries for different combinations of the relevant parameters. Server-side concatenative synthesis can be implemented in an extremely fast manner without requiring any signal processing or any call to a random function. Its main (or only significant) cost is searching the dictionary for target grains which can be efficiently implemented.
Extracting the required metadata on the client side can also be implemented in an extremely fast manner and can be performed as part of other required processing, e.g., voice font processing. The metadata (e.g., pitch, intensity and voice-state) extracted from the input speech can be efficiently packed into 8-bit words and updated at a low rate, typically 5 to 16 Hz. This information could be shared by a viseme selection algorithm for visual speech synthesis.
Next, with reference to FIG. 3, we described another class of embodiments in which scrambled speech synthesis is performed by an audio server (server 11 of FIG. 3) and at least one client device (client devices 22 and 23 of FIG. 3) coupled thereto typically by at least one network. Server 11 is configured to assert (e.g., stream) grain identifiers indicative of grains of scrambled speech to a selected one (or both) of client devices 22 and 23, but not to assert the grains themselves to either of client devices 22 and 23. Audio server 11 includes a switching subsystem (referred to herein as a "scrambling matrix") coupled to server l l's input and output ports (which are in turn coupled to devices 22 and 23 and optionally also to other client devices not shown in FIG. 3). The scrambling matrix (together with the input and output ports) can be identical to subsystem 7 of server 1 of FIG. 1. Each of client devices 22 and 23 includes a scrambled speech synthesis subsystem (to be described) which can be similar to subsystem 9 of FIG. 1 but which differs in some respects (to be described) from subsystem 9 of FIG. 1. Game server 15 of FIG. 3 (which can be identical to game server 5 of FIG. 1) is coupled to server 11.
Memory 12, coupled locally to client device 22, stores a first dictionary of grains for use in concatenative synthesis of scrambled speech (e.g., scrambled speech in an
unintelligible alien language). Memory 13, coupled locally to client device 23, stores a second dictionary of grains (which can be identical to the first dictionary) for use in concatenative synthesis of scrambled speech. No such dictionary of grains is stored in (or local to) server 11.
Client device 22 is configured to analyze input speech data indicative of speech uttered by a user into microphone 24 (connected locally to device 22), to determine a sequence of input speech snippets (and optionally also input metadata for each of all or some of the snippets). Client device 23 is configured to analyze input speech data indicative of speech uttered by a user into microphone 25 (connected locally to device 23), to determine a sequence of input speech snippets (and optionally also input metadata for each of all or some of the snippets).
Device 22' s scrambled speech synthesis subsystem selects an appropriate grain identifier indicative of a corresponding grain of scrambled speech stored in the dictionary in memory 12 (e.g., a stored grain of speech in an alien language) for each of a sequence of the analyzed snippets of input speech, and device 22 asserts both the input speech stream and the selected grain identifiers (and typically also input metadata associated with the grain identifiers) to audio server 11. The grain identifiers and associated metadata typically function as a very low bitrate encoding of scrambled speech to be generated in response to the input speech.
Similarly, device 23 's scrambled speech synthesis subsystem selects an appropriate grain identifier indicative of a corresponding grain of scrambled speech stored in the dictionary in memory 13 (e.g., a stored grain of speech in an alien language) for each of a sequence of the analyzed snippets of input speech, and device 23 asserts both the input speech stream and the selected grain identifiers (and typically also input metadata associated with the grain identifiers) to audio server 11. Typically also, the grain identifiers and associated metadata function as a very low bitrate encoding of scrambled speech to be generated in response to the input speech.
Audio server 11 ' s above-mentioned scrambling matrix includes server 11 ' s input and output ports, and is configured to route both the input speech and the grain identifiers (and any input metadata) from each of client devices 22 and 23 from the relevant one of server 11 ' s input ports to all (or an appropriate subset) of server 11 ' s output ports for transmission to the relevant one(s) of client devices 22 and 23.
Since grains that comprise scrambled speech are not themselves available to server 11, the server 11 does not mix any scrambled speech stream with any other scrambled speech stream or with unscrambled speech stream. In the FIG. 3 system, audio server 11 functions as a forwarding bridge; not a mixing bridge. If (as is typical) each of devices 22 and 23 sends grain identifiers and any associated metadata to server 11 at a low bitrate (e.g., less than 1 Kbit/sec.) compared to the bitrate of a typical voice stream, a simple (and inexpensive) implementation of server 11 can forward the received grain identifiers (and any associated metadata) from devices 22 and 23 (i.e., grain identifiers and metadata for all scrambled speech streams to be generated) independently to each destination client device. For example, in a short time window, client device 23 (or an additional client device other than devices 22 and 23) might receive from server 11 one or several packets of compressed intelligible (unscrambled) speech data from device 22 as well as one or several packets of encoded scrambled (unintelligible language) speech data from device 23, each of the packets of encoded scrambled speech data comprising a grain identifier (and optionally also associated metadata) indicative of a grain of scrambled speech.
In response to reception of a scrambled speech packet containing a grain identifier and associated metadata, a scrambled speech synthesis subsystem in each client device of the FIG. 3 system synthesizes the corresponding scrambled speech audio signal using the dictionary resident in its memory (e.g., the speech synthesis subsystem of device 22 synthesizes the corresponding scrambled speech audio signal using the dictionary resident in memory 12 and the speech synthesis subsystem of device 23 synthesizes the scrambled speech audio signal using the dictionary resident in memory 13). Speech synthesis in response to a sequence of the packets typically includes a step of reading a corresponding sequence of grains of scrambled speech from the dictionary in response to a sequence of grain identifiers and associated metadata indicated by the packet sequence (or decompressing the sequence of packets of grain identifiers and associated metadata, in cases in which the packets are received in compressed form, and then reading a corresponding sequence of grains of the scrambled speech from the dictionary in response to the decompressed grain identifiers and associated metadata). In contrast with server-side implementations of speech scrambling (e.g., that of above-described FIG. 1), it is possible (in the FIG. 3 system and other client-side implementations of speech scrambling) for a client device to perform additional processing (e.g., pitch correction) at the time of synthesizing scrambled speech in response to grain identifiers (and optionally also metadata) received from an audio server, since the grain identifying data (and any associated metadata) that determine each scrambled speech stream are never mixed with other streams received by the client device. In order to configure each client device to perform such additional processing, it would of course typically be necessary to implement the client device to have more processing power than if the client device were not configured to perform such additional processing. Thus, the FIG. 3 system and other client-side implementations of speech scrambling would typically require more processing and memory on the client side than would the FIG. 1 system and other server-side implementations of speech scrambling in accordance with the invention.
In some implementations of FIG. 3, each of memories 12 and 13 stores one or more speech scrambling dictionaries (each including grains of speech which are independently retrievable to perform concatenative speech synthesis, e.g., each dictionary including grains of speech in a different language), and each of client devices 22 and 23 is configured to indicate to audio server 11 which dictionary (or dictionaries) it has locally. Audio server 11 includes (or is coupled locally to) memory in which all these dictionaries are stored. Thus, each dictionary to be used for concatenative speech synthesis is stored in (or available locally to) audio server 11 , and is optionally also stored in (or available locally to) one or both of client devices 22 and 23. Audio server 11 is also configured to:
send (e.g., forward from client device 22 or 23, or generate internally and then assert) grain identifiers and optionally also associated metadata, to each of client devices 22 and 23 that has a local dictionary useful for concatenative synthesis of scrambled speech in response to the grain identifiers and any associated metadata; and
send (i.e., generate internally and then assert) scrambled speech (sequences of grains read from a dictionary in or local to audio server 11) to each of client devices 22 and 23 that does not have a local dictionary useful for synthesizing the scrambled speech in response to grain identifiers (or grain identifiers and associated metadata).
It should be appreciated that in a class of variations on the FIG. 3 system, audio server 11 is implemented in a client device (e.g., is a "peer-hosted server"). This client device would include a scrambling matrix of the above-described type, and would be configured to assert (e.g., stream) grain identifiers indicative of grains of scrambled speech to other client devices that do not implement the functions of server 11 without asserting the grains themselves to these other client devices. In other variations on the FIG. 3 system (e.g., systems having "peer-to-peer" architecture), each client device implements the functions of audio server 11. In these variations, each client device would include a scrambling matrix of the above- described type, would include (or be coupled locally to) memory storing a dictionary of the above-described type, and would be configured to assert (e.g., stream) grain identifiers indicative of grains of scrambled speech to other ones of the client devices without asserting the grains themselves to the other client devices.
In typical embodiments, the inventive concatenative speech synthesis system (e.g., server 1 or subsystem 9 of FIG. 1 or client device 22 or 23 of FIG. 3), which may be configured to analyze input speech data (indicative of input speech in a first language) to generate data indicative of a sequence of snippets of the input speech which determine grains of speech in a second language (e.g., an implementation of subsystem 9 of FIG. 1 configured to analyze speech data from any of devices Cl-CN indicative of speech in a first language to generate first data indicative of a sequence of snippets of the speech), is or includes a general or special purpose processor programmed with software (or firmware) and/or otherwise configured to perform an embodiment of the inventive method. In other embodiments, the inventive concatenative speech synthesis system is implemented by appropriately configuring (e.g., by programming) a configurable audio digital signal processor (DSP) to perform an embodiment of the inventive method. The audio DSP can be a conventional audio DSP that is configurable (e.g., programmable by appropriate software or firmware, or otherwise configurable in response to control data) to perform any of a variety of operations on input speech data.
In some embodiments, the inventive concatenative speech synthesis system is a general purpose processor, coupled to receive input data (indicative of input speech or grain identifiers, and optionally also metadata) and programmed to generate output data indicative of scrambled speech (or grain identifiers and optionally also metadata), and optionally also to route the output data to one or more selected recipients (e.g., to one or more ports each configured to be coupled to a different client device), in response to the input data by performing an embodiment of the inventive method. The processor is typically programmed with software (or firmware) and/or otherwise configured (e.g., in response to control data) to perform any of a variety of operations on the input data, including an embodiment of the inventive method. The computer system of FIG. 4 is an example of such a system. The FIG. 4 system includes general purpose processor 501 which is programmed to perform any of a variety of operations on input data, including an embodiment of the inventive method.
The computer system of FIG. 4 also includes input device 503 (e.g., a mouse and/or a keyboard) coupled to processor 501, storage medium 504 coupled to processor 501, and display device 505 coupled to processor 501. Processor 501 is programmed to implement the inventive method in response to instructions and data entered by user manipulation of input device 503. Computer readable storage medium 504 (e.g., an optical disk or other tangible object) has computer code stored thereon that is suitable for programming processor 501 to perform an embodiment of the inventive method. The computer code may include code that determines each scrambling dictionary to be employed to perform concatenative synthesis in accordance with an embodiment of the invention. In operation, processor 501 executes the computer code to process data indicative of input speech (or grain identifiers) and optionally also metadata in accordance with the invention to generate output data indicative of scrambled speech (or grain identifiers) and optionally also metadata.
Server 1 (or subsystems 7 and 9 of server 1) of above-described FIG. 1, or client device 22 or 23 of above-described FIG. 3, could be implemented as general purpose processor 501, with the "audio input" of FIG. 4 being data indicative of input speech (or grain identifiers) and optionally also metadata, and the "audio output" of FIG. 4 being output data indicative of scrambled speech (or grain identifiers) and optionally also metadata. A
conventional digital-to-analog converter (DAC) could operate on the output data to generate analog versions of output speech (or other output audio) data for reproduction by physical speakers.
Aspects of the invention are a computer system programmed to perform any embodiment of the inventive method, and a computer readable medium which stores computer- readable code for implementing any embodiment of the inventive method.
While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
Next Patent: CABLE DUCT AND HOSE ATTACHMENT
