Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
INTENT-BASED LANGUAGE TRANSLATION
Document Type and Number:
WIPO Patent Application WO/2021/016345
Kind Code:
A1
Abstract:
The present inventive concept contemplates a system or method of translating a user's voice and intent into a different language. The method contemplates extracting the objectives of a first voice input and translating those objectives to a different language with different vocal characteristics. Vocal characteristics comprise any facet of communicative expression associated with an objective.

Inventors:
DALCE REGINALD (US)
Application Number:
PCT/US2020/043058
Publication Date:
January 28, 2021
Filing Date:
July 22, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DALCE REGINALD (US)
International Classes:
G06F40/58; G06F3/16; G06F9/451; G10L13/08; G10L15/00
Foreign References:
JP2007148039A2007-06-14
US20160110350A12016-04-21
KR20020004490A2002-01-16
US20120078607A12012-03-29
CN104991892A2015-10-21
Attorney, Agent or Firm:
FISH, Robert D. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method of translating content and meaning of a voice input of a user in a first language into a second language, comprising:

receiving the voice input in the first language;

determining a voice input content;

analyzing a vocal characteristic of the voice input with respect to the first language; extracting an objective of the voice input based on the voice input content and analysis of the vocal characteristic;

determining a voice output content in the second language;

determining a vocal characteristic output associated with the second language that,

applied to the voice output content, conveys the objective in the second language; and

applying the vocal characteristic output to the voice output content to generate a voice output.

2. The method of claim 1, wherein determining the voice input content comprises identifying at least two grammatical elements and assigning a semantic relation between the two grammatical elements.

3. The method of claim 1 or 2, wherein the vocal characteristics are associated with one or more emotions or intentions.

4. The method of any of claims 1 to 3, further comprising receiving a desired voice output.

5. The method of any of claims 1 to 4, wherein extracting the objective of the voice input comprises requesting input from the user identifying, at least partially, the objective of the voice input.

6. The method of any of claims 1 to 5, further comprising translating at least one of a written input or physical input associated with the first language to at least one of a written output or physical output associated with the second language.

7. The method of any of claims 1 to 6, wherein translating the content and meaning of the voice input in the first language into the second language occurs in real-time.

8. The method of any of claims 1 to 7, further comprising conveying the voice output to a second language preference user.

9. The method of claim 8, further comprising translating content and meaning of a third language voice input into the first language, at least partially simultaneous with conveying the voice output to the second language preference user.

10. The method of claim 9, wherein the second language is different than the third language.

11. The method of any of claims 8 to 10, further comprising:

at least partially overlapping with conveying the voice output to the second language preference user, receiving a third language voice input;

determining a third language voice input content;

analyzing a third language vocal characteristic of the voice input with respect to the third language;

extracting an objective of the third language voice input based on the third language voice input content and analysis of the third language vocal characteristic;

determining a first language voice output content with respect to the third language voice input;

determining a first language vocal characteristic output associated with the third language that, applied to the first language voice output content, conveys the objective in the first language; and

applying the first language vocal characteristic output to the first language voice output content to generate a first language voice output.

12. A method of translating an objective from a first communication to a second

communication, comprising:

receiving a first communication input;

determining a first root content and a first communication characteristic of the first

communication input; extracting an objective of the first communication input based on the first root content and the first communication characteristic;

translating the first root content and first communication characteristic into a second root content and second communication characteristic, respectively; and applying the second communication characteristic to the second root content to generate a second communication output, such that the second communication output conveys the objective.

13. The method of claim 12, wherein the first communication is in a first language, and the second communication is in a second language different than the first.

14. The method of claim 12 or 13, wherein translating the first communication into a second communication occurs in real-time during a communication between a user preferring the first language and a user preferring the second language.

15. The method of any of claims 12 to 14, wherein the first communication is at least one of a voice, a text, a symbol, a pictorial, or a physical gesture.

16. The method of any of claims 12 to 15, wherein the communication is at least one of a video call or a phone call.

17. The method of any of claims 12 to 16, further comprising conveying the second

communication to the user preferring the second language, and further comprising translating a third language communication input into a first language communication output at least partially simultaneously with conveying the second communication.

18. The method of any of claims 12 to 17, wherein the first root content comprises at least two grammatical elements and a semantic relationship between the two grammatical elements.

Description:
INTENT-BASED LANGUAGE TRANSLATION

[0001] This application claims priority to U.S. application with the serial number 16/519,838, filed July 23, 2019, which is incorporated herein by reference in its entirety.

Field of the Invention

[0002] The field of the invention is language translation.

Background

[0003] The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided in this application is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.

[0004] There are instances in the prior art describing the use of captured images in PCT Patent Application No. US 2004/013366 to Cutaia. Cutaia discloses a method of storing different speech vectors associated with different speakers and different translations. However, Cutaia fails to consider the differences between cultural expressions of emotions and their associated unique vocal characteristics.

[0005] US Patent No. 7,437,704 to Dahne-Stuber discloses a real-time software translation method that translates text to a different language to localize the software content in real-time such that post-release localization and its accompanying delays are unnecessary. However, Dahne-Stuber fails to contemplate the complexities of translating spoken language in real-time while translating the intent behind a statement in one language to another, which can require various vocal characteristics to be translated differently in different languages rather than simple mirroring. For example, anger in American-English can be expressed with different intonation and pacing than a speaker would use in Japanese.

[0006] All publications identified herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply. In this patent application, a camera is installed on a person or a object, and captured images by the camera were sent to and viewed by mobile phones and/or computers owned by the third party.

[0007] As used in the description herein and throughout the claims that follow, the meaning of “a,”“an,” and“the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of“in” includes“in” and“on” unless the context clearly dictates otherwise.

[0008] Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their end points, and open-ended ranges should be interpreted to include only commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.

[0009] The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value with a range is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly

contradicted by context. The use of any and all examples, or exemplary language (e.g.“such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.

[0010] Groupings of alternative elements or embodiments of the invention disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the

specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims. [0011] Thus, there is still a need to translate vocal characteristics associated with a translation to accurately reflect the intent of the original statement.

Summary of The Invention

[0012] The inventive subject matter provides apparatus, systems and methods for translating the voice content and vocal characteristics of a voice input to a translated voice content and translated vocal characteristics.

[0013] The methods herein contemplate translating content and meaning of a voice input in a first language into a second language. The content and meaning of a voice input are translated by analyzing and determining the voice input content and associated vocal characteristics. The invention herein further contemplates extracting an objective of the voice input content and vocal characteristics within the context of the first language, for example based on the analysis of the input and vocal characteristics. Based on the extracted objectives and vocal characteristics, a second set of vocal characteristics and voice input content associated with the second language is determined. The original voice input is then converted to the second language with

corresponding vocal characteristics that convey the meaning behind the original voice input. It is important to note that the vocal characteristics of the second language to convey a particular emotion can be different from the vocal characteristics for the first language for the same emotion.

[0014] Further systems, methods, and devices for translating an objective from a first communication to a second communication are contemplated. A first communication input is received, and a first root content and a first communication characteristic of the first

communication input are determined. An objective of the first communication input is extracted based on the first root content and the first communication characteristic. The first root content and first communication characteristic are translated into a second root content and second communication characteristic, respectively. The second communication characteristic is applied to the second root content to generate a second communication output, such that the second communication output conveys the objective (e.g., with respect to a second language, etc.). The second communication is then conveyed to the user preferring the second language, providing not only translation but the objective of the original communication as well. [0015] Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.

Brief Description of The Drawings

[0016] Fig. l is a functional block diagram illustrating a distributed data processing

environment.

[0017] Fig. 2 is a schematic of a method of extracting objectives from a voice input.

[0018] Fig. 3 depicts a method for translating a voice input and extracted objectives to a different language with different vocal characteristics.

[0019] Fig. 4 depicts a block diagram of components of the server computer executing translation engine 110 within the distributed data processing environment of Fig. 1.

[0020] Fig. 5 depicts an illustration of multi-user communication of the inventive subject matter.

Detailed Description

[0021] The inventive subject matter contemplates methods, systems, and devices for translating content and meaning of a voice input of a user in a first language into a second language. The voice input in the first language is received and a voice input content is determined. A vocal characteristic of the voice input with respect to the first language is also analyzed. An objective of the voice input is determined based on the voice input content and analysis of the vocal characteristic. A voice output content in the second language is then determined. A vocal characteristic output associated with the second language is generated, such that when applied to the voice output content, the product conveys the objective in the second language. The vocal characteristic output is then applied to the voice output content to generate a voice output, which is generally conveyed to a second language preference user.

[0022] In some embodiments, determining the voice input content includes identifying at least two grammatical elements and assigning a semantic relation between the two grammatical elements. The vocal characteristics are typically associated with one or more emotions or intentions or innuendos. The inventive systems and methods can also be guided by receiving a desired voice output, for example from the first language or originating language user. Likewise, extracting the objective of the voice input is aided in some embodiments by requesting input from the first language or originating user identifying, at least partially, the objective of the voice input.

[0023] Further translations, either simultaneously or sequentially with the above, include translating at least one written input or physical input associated with the first language to at least one written output or physical output associated with the second language. Preferably, translating the content and meaning of the voice input in the first language into the second language occurs in real-time. Similarly, translating content and meaning of a third language voice input into the first language can occur at least partially simultaneous with conveying the voice output to the second language preference user, for example during overlapping discussion between multiple users with multiple language preferences. Typically the second language is different than the third language, though some users may share one or more common preferred languages.

[0024] Viewed from another perspective, a third language voice input is received at least partially overlapping with one or more steps above. A third language voice input content is determined and a third language vocal characteristic of the voice input is analyzed with respect to the third language. An objective of the third language voice input is extracted based on the third language voice input content and analysis of the third language vocal characteristic. A first language voice output content is then determined with respect to the third language voice input.

A first language vocal characteristic output associated with the third language is likewise determined such that, applied to the first language voice output content, the product conveys the objective in the first language. The first language vocal characteristic output is then applied to the first language voice output content to generate a first language voice output.

[0025] Further systems, methods, and devices for translating an objective from a first

communication to a second communication are contemplated. A first communication input is received, and a first root content and a first communication characteristic of the first

communication input are determined. An objective of the first communication input is extracted based on the first root content and the first communication characteristic. The first root content and first communication characteristic are translated into a second root content and second communication characteristic, respectively. The second communication characteristic is applied to the second root content to generate a second communication output, such that the second communication output conveys the objective (e.g., with respect to a second language, etc.). The second communication is then conveyed to the user preferring the second language, providing not only translation but the objective of the original communication as well.

[0026] The first communication is typically in a first language, and the second communication is typically in a second language different than the first. In preferred embodiments, translating the first communication into a second communication occurs in real-time during a communication between a user preferring the first language and a user preferring the second language, thereby facilitating at least partially, sometimes completely, overlapping back-and-forth communication. The first communication is generally at least one of a voice, a text, a symbol, a pictorial, or a physical gesture (e.g., sign language, body language, etc.), for example conveyed by a video call, a phone call, a text based chat, or combinations thereof. The first root content is typically at least two grammatical elements and a semantic relationship between the two grammatical elements.

[0027] In some embodiments, overlapping communication is translated in real-time. For example, a third language communication input is translated into a first language communication output at least partially simultaneously with conveying the second communication.

[0028] One should appreciate that the inventive subject matter provides a system or method that allows users to view images captured by a worn camera by use of a mobile phone and/or a computer. Some aspects of the inventive subject matter include a method of providing a system that enables people (e.g., the third party) to view environment surrounding the person at real time and/or later, and/or to select the visible focusing range and the range of visible wavelength which are in the range of human eyes , such that expanding the sight capability.

[0029] The following discussion provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.

[0030] Figure l is a functional block diagram illustrating a distributed data processing environment.

[0031] The term“distributed” as used herein describes a computer system that includes multiple, physically distinct devices that operate together as a single computer system. Fig. 1 provides only an illustration of one implementation and does not imply any limitations with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environment may be made by those skilled in the art without departing from the scope of the invention as recited by the claims.

[0032] Distributed data processing environment 100 includes computing device 104 and server computer 108, interconnected over network 102. Network 102 can include, for example, a telecommunications network, a local area network (LAN), a wide area network (WAN), such as the Internet, or a combination of the three, and can include wired, wireless, or fiber optic connections. Network 102 can include one or more wired and/or wireless networks that are capable of receiving and transmitting data, voice, and/or video signals, including multimedia signals that include voice, data, and video information. In general, network 102 can be any combination of connections and protocols that will support communications between computing device 104, server computer 108, and any other computing devices (not shown) within distributed data processing environment 100.

[0033] It is contemplated that (e.g., held) computing device 104 can be any programmable electronic computing devices capable of communicating with various components and devices within distributed data processing environment 100, via network 102. It is further contemplated that computing device 104 can execute machine readable program instructions and communicate with any devices capable of communication wirelessly and/or through a wired connection. As depicted, computing device 104 includes an instance of user interface 106. However, it is contemplated that any electronic device mentioned herein can include an instance of user interface 106. [0034] User interface 106 provides a user interface to translation engine 110. Preferably, user interface 106 comprises a graphical user interface (GUI) or a web user interface (WUI) that can display one or more of text, documents, web browser windows, user option, application interfaces, and operational instructions. It is also contemplated that user interface can include information, such as, for example, graphics, texts, and sounds that a program presents to a user and the control sequences that allow a user to control a program.

[0035] In some embodiments, user interface can be mobile application software. Mobile application software, or an“app,” is a computer program designed to run on smart phones, tablet computers, and any other mobile devices.

[0036] User interface 106 can allow a user to register with and configure translation engine 110 (discussed in more detail below) to enable a user to access a mixed reality space. It is contemplated that user interface 106 can allow a user to provide any information to translation engine 110.

[0037] Server computer 108 can be a standalone computing device, a management server, a web server, a mobile computing device, or any other computing system capable of receiving, sending, and processing data. It is contemplated that server computer 108 can include a server computing system that utilizes multiple computers as a server system, such as, for example, a cloud computing system. In other embodiments, server computer 108 can be a computer system utilizing clustered computers and components that act as a single pool of seamless resources when accessed within distributed data processing environment 100.

[0038] Database 112 is a repository for data used by translation engine 110. In the depicted embodiment, translation engine 110 resides on server computer 108. However, database 112 can reside anywhere within a distributed data processing environment provided that translation engine 110 has access to database 112. Data storage can be implemented with any type of data storage device capable of storing data and configuration files that can be accessed and utilized by server computer 108. Data storage devices can include, but are not limited to, database servers, hard disk drives, flash memory, and any combination thereof. [0039] Figure 2 is a schematic of method 210 of extracting objectives from a voice input.

Translation engine 110 receives voice input (step 202). It is contemplated that voice input can include an actual voice input or any other input that represents a communication. In a preferred embodiment, voice input is the actual voice communications of a user. Alternatively, voice inputs can include, but are not limited, text (e.g., digital, analog, symbolic, etc.) input, visual communication inputs, sign language inputs, or any other form of communicative expression.

[0040] It is further contemplated that voice input can be received using any communication medium available in the art. For example, computing device 104 as depicted in Figure 1 can include, but are not limited to, smart phones, laptop computers, tablet computers, microphones, an any other computing devices capable of receiving a communicative expression. It is further contemplated that the voice input can be transmitted to any one or more components of the distributed data processing environment depicted in Figure 1.

[0041] In one embodiment, translation engine 110 receives a voice input from a user through a personal computing device. In this embodiment, it is contemplated that a user can interface with translation engine 110 via user interface 106. For example, a user can access translation engine 110 through a smart phone application and manipulate one or more parameters associated with translation engine 110. However, computing device 104 may not have a user interface, and the user may be limited to submitting voice input without any additional control via user interface 106 or any other user input interface.

[0042] In an alternative embodiment, translation engine 110 can receive a text input from a user through computing device 104. Based on the content of the message and any other indicators of the intent of the message (e.g., commas, exclamation points, question marks, symbols, emoticons, etc.), translation engine 110 process any translations with additional context provided by the other indicators of intent.

[0043] Translation engine 110 analyzes the content of the voice input (step 204). The content of the voice input can include any objective (e.g., contextualized by speaker language, speaker technical field, speaker culture, speaker religion, speaker gender, speaker age, speaker education level, speaker demographic, etc.) or subjective (e.g., speaker defined, e.g., defined respectful, defined happy, defined agreeable, defined pleading, defined sad, defined hostile, etc.) characteristics of the voice input. For example, translation engine 110 can analyze the words spoken by a user, text entered by a user, or combinations thereof. In another example, translation engine 110 can analyze the length of the voice input.

[0044] In alternative embodiment, the voice input can be an alternative input, such as a text- based or sign-based input, or combinations thereof. For example, translation engine 110 can translate text written (analog or digital) by user from one language to another. In another example, translation engine 110 can be coupled to a camera to translate sign language to a different language and/or the same language in a different form (e.g., American Sign Language to spoken English). Where a camera is used to capture part of the communication of a user, translation engine 110 also analyses and interprets body language (e.g., hunched shoulders, crossed arms, nodding head, etc.) or other physical characteristics (e.g., red face, white face, sweating, tired eyes, etc.) of a communicator or user.

[0045] In a preferred embodiment, translation engine 110 analyzes the words spoken by the user and the meaning behind the words. For example, translation engine 110 can analyze a Chinese language voice input and derive the literal meaning of the voice input based on a direct translation of the words spoken by additionally analyzing the intonation and the pacing of the words. It is further contemplated that translation engine 110 can differentiate between non- communicative sounds in the voice input and actual language. For example, translation engine 110 can identify place holder words, such as“and like”, used in a voice input and omit those words in deriving the meaning of the voice input.

[0046] In another embodiment, translation engine 110 can use machine learning techniques to determine the objective of the voice input specific to a user. For example, translation engine 110 can use a supervised learning classifier to determine which combination of words, pacing, tone, and any other relevant vocal characteristics are associated with sarcasm for a particular user. In a more specific example, translation engine 110 can analyze the vocal characteristics associated with the phrase“I totally hate you” to determine that the phrase is sarcastic rather than a serious expression of hatred. It is also contemplated that a speaker or communicator can specify the intention of a statement, such as indicating that“I totally hate you” is serious or sarcastic. [0047] In another example, translation engine 110 can use a time series classifier to extract user trends with voice inputs to determine that the particular phrase“Let’s grab a drink” refers to non alcoholic beverages prior to 6:00 PM and alcoholic drinks after 6:00PM.

[0048] Translation engine 110 analyzes the vocal characteristics of the voice input (step 206). Vocal characteristics can include, but are not limited to, any identifiable characteristics associated with the voice input. For example, vocal characteristics can include intonation, pacing, pitch, and volume.

[0049] Vocal characteristics are analyzed based on the language (e.g., formal, related to specific field, dialect, slang, etc.) and culture (e.g., region, race, ethnicity, religion, other demographics, etc.) of the voice input. Translations of the voice input are synthesized based on the

corresponding vocal characteristics based on the language and culture of the voice output. Vocal characteristics can be defined in any manner available in the art. For example, vocal

characteristics can be mined from public databases, taken from private databases, and/or inputted directly by a user to translation engine 110 via a user interface.

[0050] In one embodiment, translation engine 110 analyzes a sound-based voice input based on the intonation, pacing, pitch, and volume of the voice input. For example, translation engine 110 can determine that a voice input from a menacing user has a rising intonation, a lower pitch, a slower speech rate, and increasing loudness over time. In another example, translation engine 110 can determine that a voice input from a ten-year-old child has a constant intonation, a higher pitch, a faster speech rate, and consistent loudness over time. In yet another example, translation engine 110 can determine that a voice input from a scared user has a wavering intonation, a higher pitch, a faster speech rate, an increasing number of irregular pauses over time, and a consistent quietness in the voice input.

[0051] In some embodiment, translation engine 110 analyzes a text-based voice input based on the content of the message, the punctuations, pictographs, symbols, and the structure of the text. For example, translation engine 110 can determine that a short message service (SMS) text message includes few, abbreviated sentences ending in exclamation points, a smiling emoji, and words indicating happiness about a particular event. In another example, translation engine 110 can determine that an email-based message includes proper language, long form paragraphs, and business or field related jargon.

[0052] Moreover, translation engine 110 analyzes a visual voice input (e.g., sign language) based on the content of the message, the pacing, and the body language of the speaker. For example, translation engine 110 can analyze sign language and determine that the message includes motivational words, consistent pacing, and non-exaggerated motions. In another example, translation engine 110 can analyze sign language and determine that the message follows a rhythmic pacing, words associated with struggle, and large, exaggerated motions.

[0053] It should be appreciated that, in some applications, translation engine 110 analyzes a combination of audio input, textual input, and visual input. Where the interpreted meaning or intent of multiple forms of communication (e.g., audio, textual, visual) in a communicative statement are in conflict or do not agree, translation engine 110 prioritizes an intended meaning (e.g., based on greatest consistent level between multiple forms) and devalues expressions in conflict with the intended meaning (e.g., ignores a sneeze, a cough, a squint, or unfortunate typo, rather than interpreting as aggressive, dismissive, or insulting, etc.). Moreover, where a single element of a communication is not aligned with the rest of the communication (e.g., errant typo, etc.) translation engine ignores such elements, or at least provides disclaimer in the translation that such element is likely an error.

[0054] Translation engine 110 extracts one or more objectives associated with the voice input content and vocal characteristics (step 208). Objectives can include any purpose behind the message. For example, objectives can be extracted based on the content of the message, characteristics of the message recipient, and the auditory characteristics of the message.

[0055] In one embodiment, translation engine 110 extracts one or more objectives associated with a verbal input with associated vocal characteristics. Continuing a first example in step 206, translation engine 110 can determine that the voice input from the ten-year-old child has the objective of explaining an exciting occurrence during the child’s school day. Continuing a second example in step 206, translation engine 110 can determine that the voice input and characteristics of the scared user have the objective of conveying a warning about a hazard and requesting assistance regarding the hazard. [0056] Moreover, translation engine 110 extracts one or more objectives associated with a text- based input with associated vocal characteristics in text form. Continuing a first example in step 206, translation engine 110 can determine that the text message has the objective of conveying happiness and excitement about a forthcoming family vacation. Continuing a second example in step 206, translation engine 110 can determine that the email has the objective of confirming plans for a meeting to discuss a potential merger between two large corporations.

[0057] In some embodiments, translation engine 110 extracts one or more objectives associated with visual voice inputs. Continuing a first example in step 206, translation engine 110 can determine that the sign language including motivational words has the objective of offering support for individuals who have recently lost their ability to speak. Continuing a second example in step 206, translation engine 110 can determine that the sign language with rhythmic pacing has the objective of translating the lyrics of a rap performer into sign language for a deaf audience.

[0058] In some embodiments, translation engine 110 directly asks the user a questions or requests user input for a voice input. For example, translation engine 110 can directly ask a user whether the statement that they just said was sarcastic. In another example, translation engine 110 can ask the user what the context of their statement will be prior to the user providing a voice input. Further, the user can make affirmative indications of intent, context, or tone, when making a communicative statement (e.g., audibly say the statement is sarcastic, textually indicate the statement is sarcastic, physically gesture that the statement is sarcastic, etc.).

[0059] Further, in some applications translation engine 110 analyzes a combination of audio input, textual input, and visual input, and disregards or at least provides warning where the context, intent, or connotation of an element of one or more inputs is in disagreement with consensus context, intent, or connotation of the remaining inputs of a communication.

[0060] Figure 3 depicts method 310 for translating a voice input and extracted objectives to a different language with different vocal characteristics. Translation engine 110 determines a desired translation output (step 302). A desired translation output can comprise any one or more expressions of the voice input and extracted objectives in a different form. For example, a desired translation output can be any one or more of a language, a physical expression (i.e., sign language), a picture, and a text-based message.

[0061] The desired translation output can be determined manually and/or automatically. For example, translation engine 110 can automatically detect the voices of an American woman and a Japanese man and, thereby, determine that the desired translation output will be English-to- Japanese and vice versa.

[0062] Translation engine 110 determines translated voice input content (step 304). Translated voice input content includes translations in any translation medium. For example, translation mediums can include text-based translations, speech-based translations, and pictographic translations. In an exemplary embodiment, translated voice input content is a language translation from one language to another, different language. For example, translated voice input content can be a translation of the phrase“Why is my order delayed?” into the equivalent phrase in Russian.

[0063] Translated voice input content is not always a direct translation. In situations where a literal translation does not make sense in a particular language, translation engine 110 can determine an equivalent phrase. For example, the idiom“It’s raining cats and dogs,” which is understandable as an idiom in English can be translated to“It is raining very heavily” in

Japanese. In another example, the phrase“He’s a know-it-all” in American English can be translated to“He’s a know-all” when translated to British English.

[0064] Translation engine 110 determines translated vocal characteristics (step 306). Translated vocal characteristics include any vocal characteristics specific to the translated language that are used to help convey a message. It is further contemplated that the translated vocal characteristics are specific to the cultural background associated with the translation.

[0065] Vocal characteristics associated with particular emotions may not directly correlate between cultures. In one embodiment, translation engine 110 converts a voice input and associated vocal characteristics to a translation with different vocal characteristics than the original voice input to maintain a consistent message. [0066] For example, a phrase spoken in anger in a first language can be inputted as the phrase “I’m so angry!” with a rising inflection, a higher average volume, and higher in pitch over time. Though the voice input and vocal characteristics might indicate anger in a first culture associated with the original voice input, the vocal characteristics of the first culture may not align with the intended message in a second culture. The same emotions may be conveyed via different vocal characteristics depending on the culture. For example, anger may be expressed with a higher volume and ascending pitch in the first culture, but expressed in a lower volume, lower pitch, and descending pitch over time in the second culture.

[0067] As such, translation engine 110 can advantageously convert the vocal characteristics of the translated phrase to better reflect the message intended by the original phrase. Translation engine 110 can apply the converted vocal characteristics to the translated phrase to convey anger in the second culture.

[0068] Translation engine 110 synthesizes a translation with converted vocal characteristics (step 308). Translation engine 110 synthesizes the translation using the converted voice input content and applying translated vocal characteristics to convey the original meaning of the voice input content and context.

[0069] Translation engine 110 outputs a translation (step 309). Objectives can comprise any combination of characteristics that convey meaning

[0070] Figure 4 depicts a block diagram of components of the server computer executing translation engine 110 within the distributed data processing environment of Fig. 1. Figure 4 is not limited to the depicted embodiment. Any modification known in the art can be made to the depicted embodiment.

[0071] In one embodiment, the computer includes processor(s) 404, cache 414, memory 406, persistent storage 408, communications unit 410, input/output (I/O) interface(s) 412, and communications fabric 402. Communications fabric 402 provides a communication medium between cache 414, memory 406, persistent storage 408, communications unit 410, and I/O interface 412. Communications fabric 402 can include any means of moving data and/or control information between computer processors, system memory, peripheral devices, and any other hardware components.

[0072] Memory 406 and persistent storage 408 are computer readable storage media. As depicted, memory 406 can include any volatile or non-volatile computer storage media. For example, volatile memory can include dynamic random access memory and/or static random access memory. In another example, non-volatile memory can include hard disk drives, solid state drives, semiconductor storage devices, a read-only memory (ROM), an erasable

programmable read-only memory (EPROM), a flash memory, and any other storage medium that does not require a constant source of power to retain data.

[0073] In one embodiment, memory 406 and persistent storage 408 are random access memory and a hard drive hardwired to device 104, respectively. For example, device 104 can be a computer executing the program instructions of translation engine 110 communicatively coupled to a solid state drive and DRAM. In some embodiments, persistent storage 408 is removable.

For example, persistent storage 408 can be a thumb drive or a card with embedded integrated circuits.

[0074] Communications unit 410 provides a medium for communicating with other data processing systems or devices, including data resources used by device 104. For example, communications unit 410 can comprise multiple network interface cards. In another example, communications unit 410 can comprise physical and/or wireless communication links. It is contemplated that translation engine 110, database 112, and any other programs can be downloaded to persistent storage 408 using communications unit 410.

[0075] In a preferred embodiment, communications unit 410 comprises a global positioning satellite (GPS) device, a cellular data network communications device, and short to intermediate distance communications device (e.g., Bluetooth ® , near-field communications, etc.). It is contemplated that communications unit 410 allows device 104 to communicate with other devices associated with other users.

[0076] Display 418 is contemplated to provide a mechanism to display information from translation engine 110 through device 104. In preferred embodiments, display 418 can have additional functionalities. For example, display 418 can be a pressure-based touch screen or a capacitive touch screen. In yet other embodiments, display 418 can be any combination of sensory output devices, such as, for example, a speaker that communicates information to a user and/or a vibration/haptic feedback mechanism. For example, display 418 can be a combination of a touch screen in the dashboard of a car, a voice command-based communication system, and a vibrating bracelet worn by a user to communicate information through a series of vibrations.

[0077] It is contemplated that display 418 does not need to be physically hardwired components and can, instead, be a collection of different devices that cooperatively communicate information to a user.

[0078] Figure 5 depicts communication network 500, in which communication platform 510 hosts translation service 520. Users 532, 534, 536, and 538 each participate in a group communication via platform 510, both sending and receiving communication to and from the other users. Communication platform is capable of conveying communication in one or more mediums, for example video with transcript of verbal statements, verbal statements with accompanying private message or chat message features, as well as file or document sharing.

[0079] Where communication from one or more users is in a language different than a preferred language of another user, as is often the case, translation service 520 translates each incoming stream of communication into the preferred language of each user (as may be necessary), preferably in real-time or at least near real-time. While translation service 520 can prioritize a primary communication medium (e.g., dedicates resources first to translating verbal statements in real-time, then to translating shared files, then to translating chat messages, then to translating private messages, etc.), in preferred embodiments translation service 520 leverages cloud resources (e.g., AWS, etc.) to translate all formats of inbound communication into each respective user’s preferred language in absolute real-time.

[0080] For example, user 532 is an English speaker and prefers English, user 534 is a German speaker and prefers German, user 536 is a Japanese speaker and prefers Japanese, and user 538 is a French speaker and prefers French. From the perspective of user 532, whenever any other user communicates with user 532 or the group (e.g., German, Japanese, French, other language, etc.), user 532 receives only English translations of each communication in real-time. Likewise, whenever user 532 communicates with another user or the group, each other user only receives user 532’ s communication in each user’s preferred language. Moreover, as platform 510 fields multiple formats of communication from multiple users at least partially simultaneously, and in some cases substantially simultaneously, translation service 520 receives the multiple formats and different languages of communication from each user, and outputs the communications in each user’s respective preferred language in real-time. Such a system enables users with multiple different preferred languages to communicate with each other sequentially, partially simultaneously, or substantially simultaneously in real-time.

[0081] In some embodiments communication (e.g., live phone call, live video call, etc.) between one or more users is substantially sequential (e.g., A communicates, followed by B’s

communication, followed by C’s communication, etc.). However, it is also contemplated that communication between one or more users (e.g., up to 3, 4, 5, 10, 15, or 50) is at least partially simultaneous (e.g., A’s communication is partially interrupted or overlapped by B or C’s communication), or in some cases completely cacophonous (e.g., A, B, C, or others interrupting and attempting to communicate over each other, etc.). As communication between users becomes more simultaneous, it is critical that systems, methods, and apparatus of the inventive subject matter analyze communications, extract objectives, apply the objectives to translations of the communications, and provide the translation in real-time (e.g., less than 1 second lag, less than 0.5s, 0.1s, 0.05s, 0.01s, 0.001s lag, etc.).

[0082] For example, as user A communicates (e.g., voice, text, visual gesture, etc.) in its native language, the communication is analyzed, an objective is extracted, the objective is applied to a translation to B or C (or both) in their native language, and the translation is delivered/streamed to B or C (or both) in real-time or near real-time (e.g., within 1%, 2%, 3%, or 5% of real-time). While A is communicating, B receives the stream of A’s translated communication in real-time or near real-time, and communicates an interjection (“Could you repeat that?”), interruption (“You’ve had plenty of time, it’s my turn to talk”), or comment (“I agree”) to A’s

communication. A immediately (or near real-time) receives B’s communication and is able to respond or continue with A’s ongoing statement. To wit, systems, methods, and apparatus of the inventive subject matter allow users to communicate in their native language to other users of foreign language, and simultaneously receive instant/overlapping feedback/responsive communications originating in a foreign language but received by the user in the user’s own language.

[0083] To facilitate real-time translations of the inventive subject matter, in some embodiments translation engines and translation databases of the inventive subject matter are native to a user’s device. In such embodiments, a user’s communication is analyzed, an objective is extracted, the objective is applied to one or more translations of the user’s communication (e.g., English to French, German, and Japanese, etc.), and each translation is sent to the respective native language user from the originating user’s device. Such embodiments minimize lag time associated with first transmitting communications to a networked server before conveying the translation to the intended receiving user.

[0084] It is further contemplated that systems, methods, and apparatus of the inventive subject matter select between translation processes on the user’s local device or translation processes on a networked server based on hardware and network environments of each user. For example, where a user has a device with sufficient or superb hardware characteristics (e.g., memory, processor, electrical power, etc.) but sub-par or insufficient network characteristics (poor signal, slow upload, high latency, high packet loss, etc.), it is preferred that outgoing translation processes occur on the user’s device and are then conveyed to each respective recipient. In contrast, where a user has a device with poor hardware characteristics (low battery, slow processor, low memory, etc.) but superb network characteristics (strong signal, high upload, low latency, low packet loss, etc.), it is preferred the translation processes occur on a networked server, with translations subsequently provided to each respective user. Such determinations can be user specified or defaulted by the translation system.

[0085] The same applies to inbound translation processes. For example, where users in networked communication have local devices with capable hardware, it is advantageous for each user’s device to perform translation processes, and then broadcast recipient-language translations to each respective recipient. In such embodiments, each user’s local device is responsible for translating outgoing messages, and merely receives inbound messages in the local user’s respective language. Such embodiments facilitate real-time or near real-time partially or substantially overlapping communications between parties using disparate languages yet communicated to each party in its respective language. Such embodiments further allow a user to fine-tune how expressions, intonations, or pronunciations specific to the user (e.g., lisp, stutter, slur, etc.) are interpreted and translated by the translation system and process.

[0086] The inverse can also be applied. For example, each user can simply broadcast their native language communication, which is then received and translated into the local user’s native language by the translation system or process on the local user device. Combinations of translation systems and services for inbound and outbound communication, for example whether performed locally on a receiving or broadcasting device, or proximally on a networked server, are also contemplated.

[0087] It should be appreciated that such systems and methods are applicable to audio, textual, symbolic, pictorial, and bodily/visual communication, separately or in combination. For example, it is advantageous for video or phone conferences that not only translated verbal, signed, or bodily communication be received in real-time or near real-time by each user audibly in its native language, but that such communication is accompanied by a live written transcript in the user’s native language.

[0088] It is worth specifically noting that systems, methods, and devices of the inventive subject matter not only convey live translated video or audio communication between users, but further convey live written communications simultaneously with such video or audio communication. For example during a video or phone call between multiple users, with each party receiving audible or visual communication translated to the user’s local language, the inventive system further translates written/text communications between some or all of the users. Such innovation enables individual users in a group of users on a conference video or call to hold private discussions, between two translated languages.

[0089] In some embodiments, the translation system and processes described are added to or used in conjunction with an existing platform. For example, where a group of users in networked communication use a common platform (e.g., Zoom™, Skype™, etc.), translation systems, services, or processes of the inventive subject matter are used in conjunction with the platform, for example by a plug-in or extension. In some embodiments, the system or service is installed on the user’s local device and intercepts communications to/from the platform, within the platform, or between the platform and the user display (e.g., visual display, speaker, etc.) to perform translation processes. The system or service can also be network or cloud based, performing translation processes on the platform between users within the network.

[0090] It should be noted that while the description is drawn to a computer-based imaging system, various alternative configurations are also deemed suitable and may employ various computing devices including servers, interfaces, systems, databases, engines, controllers, or other types of computing devices operating individually or collectively. One should appreciate the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, solid state drive, RAM, flash, ROM, etc.). The software instructions preferably configure the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclose apparatus. In especially preferred embodiments, the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges preferably are conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network.

[0091] It should be apparent to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the scope of the appended claims. Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and“comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. Where the specification claims refers to at least one of something selected from the group consisting of A, B, C .... and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.