Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN AUDIO-TO-TACTILE CONVERTER FOR SPEECH AND SOUND ACCESS
Document Type and Number:
WIPO Patent Application WO/2021/076094
Kind Code:
A1
Abstract:
Systems and methods are provided for translating sounds, e.g., vowel sounds of speech, into two-dimensional patterns of haptic stimulus. A audio signal or information determined therefrom (e.g., a spectrum) is applied to an artificial neural network or other trained encoder to generate a location in a two-dimensional vowel space that represents the input audio signal. A two-dimensional array of haptic actuators is then operated to provide haptic stimulus at the determined location on a user's skin. The magnitude of the provided stimulus is determined based on the overall energy in the input sound signal, e.g., an energy within a vowel -related band of frequencies. Using a trained encoder to map the input audio signal into a two-dimensional haptic stimulus provides an easily understood stimulus related to the identity of detected vowel or other speech sounds over time, allowing the user to more easily detect and interpret speech.

Inventors:
GETREUER PASCAL (US)
LYON RICHARD (US)
Application Number:
PCT/US2019/056103
Publication Date:
April 22, 2021
Filing Date:
October 14, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (US)
International Classes:
G10L21/06; A61F11/04; G06F3/01; G08B6/00; G09B21/00; G10L21/16; G10L25/18; G10L25/21; G10L25/30
Foreign References:
US20180300999A12018-10-18
US20190121439A12019-04-25
US5035242A1991-07-30
Attorney, Agent or Firm:
RELLINGER, Benjamin, A. (US)
Download PDF:
Claims:
CLAIMS

We claim:

1. A method comprising: obtaining an audio sample; determining, based on a first segment of the audio sample, a first power in a first band of frequencies, wherein the first band of frequencies comprises frequencies corresponding to a set of predetermined phonemes; determining, based on the first segment of the audio sample, a first audio spectrum, wherein the first audio spectrum comprises a plurality of spectral components in the first band of frequencies; applying the first audio spectrum to a trained encoder to generate a first location in a two- dimensional phoneme space; and outputting the first power and first location for operation of a two-dimensional array of haptic actuators based on the determined first power and the generated first location, wherein each haptic actuator of the two-dimensional array of haptic actuators corresponds to a respective location within the two-dimensional phoneme space, wherein operation of the two-dimensional array of haptic actuators based upon the first power and first location comprises actuating, at an amplitude based on the first power, at least one of the haptic actuators that corresponds to a location within the two-dimensional phoneme space that is proximate to the first location.

2. The method of claim 1, wherein the set of predetermined phonemes comprises vowel sounds.

3. The method of claim 2, wherein the first band of frequencies spans frequencies between 500 Hertz and 3000 Hertz.

4. The method of any of claims 1-3, further comprising: determining, based on a second segment of the audio sample, a second power in the first band of frequencies; determining, based on the second segment of the audio sample, a second audio spectrum, wherein the second audio spectrum comprises a plurality of spectral components in the first band of frequencies; applying the second audio spectrum to the trained encoder to generate a second location in the two-dimensional phoneme space; and outputting the second power and second location for operation of the two-dimensional array of haptic actuators based on the determined second power and the generated second location, wherein operation of the two-dimensional array of haptic actuators based upon the second power and second location comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators that corresponds to a location within the two- dimensional phoneme space that is proximate to the second location.

5. The method of claim 4, wherein the first segment of the audio sample and the second segment of the audio sample partially overlap.

6. The method of any of claims 1-5, wherein determining the first audio spectrum based on the first segment of the audio sample comprises performing a short-time Fourier transform on the first segment of the audio sample.

7. The method of any of claims 1-6, wherein the trained encoder comprises an artificial neural network.

8. The method of any of claims 1-7, wherein the trained encoder has been trained, using a plurality of training samples, to predict a location in the two-dimensional phoneme space based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding location in the two-dimensional phoneme space.

9. The method of claim 8, wherein each training sample in the plurality of training samples includes a location in the two-dimensional phoneme space that is one of a set of specified locations in the two-dimensional phoneme space, wherein each of the specified locations in the two-dimensional phoneme space corresponds to a respective monophthong vowel.

10. The method of any of claims 8-9, wherein the locations included in the plurality of training samples are arranged according to the International Phonetic Alphabet vowel diagram.

11. The method of any of claims 1-7, wherein the trained encoder has been trained, using a plurality of training samples, to predict a location in the two-dimensional phoneme space based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding identity of a monophthong vowel.

12. The method of claim 11, wherein the trained encoder comprises a self-organizing map.

13. The method of any of claims 1-7, wherein the trained encoder has been trained, using a plurality of training samples, to predict a plurality of outputs each corresponding to a respective monophthong vowel based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding identity of a monophthong vowel, wherein each monophthong vowel is associated with a respective location in the two-dimensional phoneme space, and wherein applying the first audio spectrum to the trained encoder to generate the first location in the two-dimensional phoneme space comprises: generating a first plurality of outputs each corresponding to a respective monophthong vowel based on the first audio spectrum; and determining the first location in the two-dimensional phoneme space based on the first plurality of outputs and the locations in the two-dimensional phoneme space associated with the monophthong vowels.

14. The method of claim 13, wherein each of the first plurality of outputs corresponds to a respective monophthong vowel.

15. The method of any of claims 13-14, wherein determining the first location in the two-dimensional phoneme space based on the first plurality of outputs and the locations in the two-dimensional phoneme space associated with the monophthong vowels comprises determining an average of the locations in the two-dimensional phoneme space associated with the monophthong vowels weighted based on the first plurality of outputs.

16. The method of claim 15, wherein determining an average of the locations in the two-dimensional phoneme space associated with the monophthong vowels weighted based on the first plurality of outputs comprises: selecting a specified number of the first plurality of outputs; and determining an average of the locations in the two-dimensional phoneme space associated with the selected outputs based on the corresponding values of the selected outputs.

17. The method of any of claims 1-16, further comprising: determining, based on the first segment of the audio sample, a second power in a second band of frequencies; and outputting the second power for operation of an additional set of one or more haptic actuators based on the determined second power, wherein operation of the additional set of one or more haptic actuators based upon the second power comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators of the additional set of one or more haptic actuators.

18. The method of claim 17, wherein the second band of frequencies comprises frequencies corresponding to glottal pulses.

19. The method of claim 18, wherein the second band of frequencies spans frequencies between 0 Hertz and 500 Hertz.

20. The method of claim 17, wherein the second band of frequencies comprises frequencies corresponding to fricatives.

21. The method of claim 20, wherein the second band of frequencies spans frequencies between 3000 Hertz and 6000 Hertz.

22. The method of any of claims 1-16, further comprising: determining, based on the first segment of the audio sample, a second power in a second band of frequencies, wherein the second band of frequencies comprises frequencies corresponding to glottal pulses; determining, based on the first segment of the audio sample, a third power in a third band of frequencies, wherein the third band of frequencies comprises frequencies corresponding to fricatives; outputting the second power for operation of a second set of one or more haptic actuators based on the determined second power, wherein operation of the second set of one or more haptic actuators based upon the second power comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators of the second set of one or more haptic actuators; and outputting the third power for operation of a third set of one or more haptic actuators based on the determined third power, wherein operation of the third set of one or more haptic actuators based upon the third power comprises actuating, at an amplitude based on the third power, at least one of the haptic actuators of the third set of one or more haptic actuators.

23. The method of claim 22, wherein the second band of frequencies spans frequencies between 0 Hertz and 500 Hertz, and wherein the third band of frequencies spans frequencies between 3000 Hertz and 6000 Hertz.

24. A non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by the controller, cause the controller to perform the method of any one of claims 1-23.

25. A system comprising: a microphone; a two-dimensional array of haptic actuators; a controller; and a non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by the controller, cause the controller to perform the method of any one of claims 1-23.

26. The system of claim 25, wherein each haptic actuator of the two-dimensional array of haptic actuators comprises at least one of a vibrator, a piezoelectric actuator, a pneumatic actuator, an electrohaptic stimulator, a solenoid, or an electric motor.

Description:
AN AUDIO-TO-TACTILE CONVERTER FOR SPEECH AND SOUND ACCESS

BACKGROUND

[0001] A variety of people are affected by hearing loss. This can include individuals who are deaf from birth, individuals who experience illness or trauma leading to lost or reduced hearing, individuals who experience gradual hearing loss due to internal or external processes, or individuals who exhibit or experience lost or diminished hearing for some other reason. Some individuals may be assisted in hearing, and in detecting and interpreting speech, through the use of hearing aids or other assistive technologies. However, for individuals who are fully deaf, due to mechanical or neurological causes, such technologies may not be able to restore the ability to perceive speech.

[0002] The ability to detect and interpret speech by fully deaf individuals may be partially restored through lip reading. However, lip reading is a difficult skill to learn, depends on the speaking being visible and visibly enunciating their words, and may not provide the ability to completely and unambiguously perceive and interpret speech. Additionally or alternatively, fully deaf individuals or others who experience difficulty perceiving and interpreting speech may rely on sound-to-touch assistive devices. Such devices detect sound (e.g., speech sound) and provide haptic stimuli related to the detected sound (e.g., touch stimuli via the operation of one or more vibrators or other haptic actuators).

SUMMARY

[0003] For persons born deaf or hard of hearing, or who later lost some or all of their capacity to hear, it can be difficult to detect and interpret speech. While individuals may learn to lip read in order to partially compensate for full or partial deafness, lip reading works best when the speaker remains visible and enunciates his or her speech. Even under ideal circumstances, it can be difficult to unambiguously interpret speech via lip reading, since certain sounds (e.g., different vowel sounds) can be visually ambiguous.

[0004] To compensate for this, an assistive device can be provided that detects sounds

(e.g., speech sounds) and converts those sounds into haptic stimuli. For example, a spectrum of the incoming sound could be determined and the determined spectrum could be presented as touch stimuli using a linear array of vibrators, electrohaptic stimulating electrodes, or using some other haptic actuators. In another example, the vowel-related content of the detected sound (e.g., sound within a range of frequencies associated with vowel sounds) could be extracted and used to provide haptic stimulus, e.g., by identifying which vowels are represented in the sound and providing corresponding stimuli. However, the method chosen for mapping the sound input into haptic stimulus can result in difficulty interpreting the stimulus as part of speech. For example, haptic stimulus provided, via a linear array of haptic actuators, to indicate the spectrum of detected sound can be difficult for a user to parse into the corresponding vowel(s), words, or other parts of speech.

[0005] To improve the interpretability of sound-based haptic stimulus, the input sound

(e.g., a spectrum of the input sound) could be mapped, using an artificial neural network or other trained encoder, into a two-dimensional phoneme space. Speech sounds (e.g., vowel sounds), when present in the input sound, could be mapped to respective locations within the two- dimensional phoneme space and a two-dimensional array of haptic actuators could then be operated to provide a haptic stimulus at the mapped location in the two-dimensional phoneme space.

[0006] Mapping the input sound to a stimulus location that varies over time within the two-dimensional phoneme space provides a user interface that provides a representation of the speech content (e.g., the identity of vowel sounds or other parts of speech within the speech content) and that allows a user to perceive sounds corresponding to speech content that they may otherwise not be able to perceive and/or interpret. For example, a monophthong vowel sound could result in a stimulus location that is static, at a single location within the two-dimensional phoneme space, over time. In another example, a diphthong vowel could result in motion of the stimulus location over time as the sound transitions between the respective locations of the two constituent monophthong vowels of the diphthong vowel.

[0007] Such a mapping may be easier to interpret than alternative schemes since the two- dimensional nature of the phoneme space lends itself to presentation of haptic stimuli across the surface of skin. Additionally, use of a trained encoder that maps input sound to locations in a phoneme space can provide a variety of improvements relative to alternative stimulus mapping schemes (e.g., relative to proving a set of haptic stimuli, each corresponding to whether a respective monophthong vowel is detected in the input sound). For example, a user can more easily interpret difficult-to-classify sounds as the location of the provided stimulus may be at a single intermediate location between locations corresponding to vowels potentially represented within the input sound; the user will thus be apprised of both the identity of the possible vowels and the fact that there is ambiguity in the input sound between the two possible vowels. Additionally, continuously mapping input sound to locations in a two-dimensional phoneme space allows transitions over time within the input sound (e.g., transitions from one monophthong vowel to another during expression of a diphthong or triphthong vowel) to be mapped to corresponding motions over time within the two-dimensional phoneme space.

[0008] An encoder trained to map an input sound (e.g., to map a spectrum of the input sound) to such haptic stimulus could include an artificial neural network (ANN) or some other type of trained machine learning element or elements. The ANN or other trained element of the trained encoder could be trained to provide an output location in the two-dimensional phoneme space directly. Alternatively, an element of the trained encoder could output a set of outputs corresponding to respective possible monophthong vowels or other parts of speech, and these outputs could then be combined to generate an output location in the two-dimensional phoneme space.

[0009] To provide additional information via the haptic stimulus, the magnitude of the provided stimulus could be related to the power present in the input sound. For example, the location of a provided haptic stimulus could correspond to an output location in a two- dimensional vowel space determined from the input sound, while the magnitude of the provided haptic stimulus could be determined based on a power present in the input sound within a range of frequencies corresponding to vowel content of the sound. This amplitude-modulation of the provided stimulus can provide additional information about speech in the input sound while also acting to gate the stimulus, preventing erroneous or irritating stimulus from being provided when no speech is present in the input sound.

[0010] Accordingly, an aspect of the present disclosure relates to a method including: (i) obtaining an audio sample; (ii) determining, based on a first segment of the audio sample, a first power in a first band of frequencies, wherein the first band of frequencies comprises frequencies corresponding to a set of predetermined phonemes; (iii) determining, based on the first segment of the audio sample, a first audio spectrum, wherein the first audio spectrum comprises a plurality of spectral components in the first band of frequencies; (iv) applying the first audio spectrum to a trained encoder to generate a first location in a two-dimensional phoneme space; and (v) outputting the first power and first location for operation of a two-dimensional array of haptic actuators based on the determined first power and the generated first location, wherein each haptic actuator of the two-dimensional array of haptic actuators corresponds to a respective location within the two-dimensional phoneme space, wherein operation of the two-dimensional array of haptic actuators based upon the first power and first location comprises actuating, at an amplitude based on the first power, at least one of the haptic actuators that corresponds to a location within the two-dimensional phoneme space that is proximate to the first location.

[0011] In some embodiments, the set of predetermined phonemes comprises vowel sounds. In such embodiments, the first band of frequencies can span frequencies between 500 Hertz and 3000 Hertz.

[0012] The method may additionally include: (vi) determining, based on a second segment of the audio sample, a second power in the first band of frequencies; (vii) determining, based on the second segment of the audio sample, a second audio spectrum, wherein the second audio spectrum comprises a plurality of spectral components in the first band of frequencies; (viii) applying the second audio spectrum to the trained encoder to generate a second location in the two-dimensional phoneme space; and (xi) outputting the second power and second location for operation of the two-dimensional array of haptic actuators based on the determined second power and the generated second location, wherein operation of the two-dimensional array of haptic actuators based upon the second power and second location comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators that corresponds to a location within the two-dimensional phoneme space that is proximate to the second location. In such embodiments, the first segment of the audio sample and the second segment of the audio sample may partially overlap.

[0013] In some embodiments, determining the first audio spectrum based on the first segment of the audio sample may include performing a short-time Fourier transform on the first segment of the audio sample.

[0014] In some embodiments, the trained encoder comprises an artificial neural network. [0015] In some embodiments, the trained encoder has been trained, using a plurality of training samples, to predict a location in the two-dimensional phoneme space based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding location in the two-dimensional phoneme space. In such embodiments, each training sample in the plurality of training samples may include a location in the two-dimensional phoneme space that is one of a set of specified locations in the two-dimensional phoneme space, wherein each of the specified locations in the two-dimensional phoneme space corresponds to a respective monophthong vowel. In such embodiments, the locations included in the plurality of training samples may be arranged according to the International Phonetic Alphabet vowel diagram.

[0016] In some embodiments, the trained encoder has been trained, using a plurality of training samples, to predict a location in the two-dimensional phoneme space based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding identity of a monophthong vowel. In such embodiments, the trained encoder comprises a self-organizing map.

[0017] In some embodiments, the trained encoder has been trained, using a plurality of training samples, to predict a plurality of outputs each corresponding to a respective monophthong vowel based on an input audio spectrum, wherein each training sample in the plurality of training samples includes an input audio spectrum and a corresponding identity of a monophthong vowel, wherein each monophthong vowel is associated with a respective location in the two-dimensional phoneme space, and wherein applying the first audio spectrum to the trained encoder to generate the first location in the two-dimensional phoneme space comprises: (a) generating a first plurality of outputs each corresponding to a respective monophthong vowel based on the first audio spectrum; and (b) determining the first location in the two-dimensional phoneme space based on the first plurality of outputs and the locations in the two-dimensional phoneme space associated with the monophthong vowels. In such embodiments, each of the first plurality of outputs may correspond to a respective monophthong vowel. In such embodiments, determining the first location in the two-dimensional phoneme space based on the first plurality of outputs and the locations in the two-dimensional phoneme space associated with the monophthong vowels may comprise determining an average of the locations in the two- dimensional phoneme space associated with the monophthong vowels weighted based on the first plurality of outputs. In such embodiments, determining an average of the locations in the two-dimensional phoneme space associated with the monophthong vowels weighted based on the first plurality of outputs may comprise: (a) selecting a specified number of the first plurality of outputs; and (b) determining an average of the locations in the two-dimensional phoneme space associated with the selected outputs based on the corresponding values of the selected outputs.

[0018] In some embodiments, the method may additionally include: (vi) determining, based on the first segment of the audio sample, a second power in a second band of frequencies; and (vii) outputting the second power for operation of an additional set of one or more haptic actuators based on the determined second power, wherein operation of the additional set of one or more haptic actuators based upon the second power comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators of the additional set of one or more haptic actuators. In such embodiments, the second band of frequencies may comprise frequencies corresponding to glottal pulses, e.g., may span frequencies between 0 Hertz and 500 Hertz. Alternatively, the second band of frequencies may comprise frequencies corresponding to fricatives, e.g., may span frequencies between 3000 Hertz and 6000 Hertz.

[0019] In some embodiments, the method may additionally include: (vi) determining, based on the first segment of the audio sample, a second power in a second band of frequencies, wherein the second band of frequencies comprises frequencies corresponding to glottal pulses; (vii) determining, based on the first segment of the audio sample, a third power in a third band of frequencies, wherein the third band of frequencies comprises frequencies corresponding to fricatives; (viii) outputting the second power for operation of a second set of one or more haptic actuators based on the determined second power, wherein operation of the second set of one or more haptic actuators based upon the second power comprises actuating, at an amplitude based on the second power, at least one of the haptic actuators of the second set of one or more haptic actuators; and (ix) outputting the third power for operation of a third set of one or more haptic actuators based on the determined third power, wherein operation of the third set of one or more haptic actuators based upon the third power comprises actuating, at an amplitude based on the third power, at least one of the haptic actuators of the third set of one or more haptic actuators. In such embodiments, the second band of frequencies may span frequencies between 0 Hertz and 500 Hertz, and the third band of frequencies may span frequencies between 3000 Hertz and 6000 Hertz.

[0020] Yet another aspect of the present disclosure relates to a non-transitory computer- readable medium, having stored thereon program instructions that, upon execution by the controller, cause the controller to perform any of the methods above.

[0021] Yet another aspect of the present disclosure relates to a system including: (i) a microphone; (ii) a two-dimensional array of haptic actuators; (iii) a controller; and (iv) a non- transitory computer-readable medium, having stored thereon program instructions that, upon execution by the controller, cause the controller to perform any of the methods above. In such embodiments, each haptic actuator of the two-dimensional array of haptic actuators may include at least one of a vibrator, a piezoelectric actuator, a pneumatic actuator, an electrohaptic stimulator, a solenoid, or an electric motor.

[0022] These as well as other aspects, advantages, and alternatives will become apparent to those of ordinary skill in the art by reading the following detailed description with reference where appropriate to the accompanying drawings. Further, it should be understood that the description provided in this summary section and elsewhere in this document is intended to illustrate the claimed subject matter by way of example and not by way of limitation.

BRIEF DESCRIPTION OF THE FIGURES

[0023] Figure 1 A depicts elements of an example system.

[0024] Figure IB depicts elements of an example trained encoder.

[0025] Figure 1C depicts elements of an example trained encoder.

[0026] Figure 2A depicts an example determined location in a 2D space.

[0027] Figure 2B depicts an example determined location in a 2D space.

[0028] Figure 3 depicts an example mapping of vowels in a 2D space.

[0029] Figure 4A depicts an example use of an actuator array.

[0030] Figure 4B depicts an example use of an actuator array.

[0031] Figure 5A depicts an example signal power over time.

[0032] Figure 5B depicts example magnitudes at which actuators are driven over time. [0033] Figure 6 is a simplified block diagram showing some of the components of an example computing system.

[0034] Figure 7 is a flowchart of a method.

DETAILED DESCRIPTION

[0035] Examples of methods and systems are described herein. It should be understood that the words "exemplary," “example,” and “illustrative,” are used herein to mean "serving as an example, instance, or illustration." Any embodiment or feature described herein as "exemplary," “example,” or “illustrative,” is not necessarily to be construed as preferred or advantageous over other embodiments or features. Further, the exemplary embodiments described herein are not meant to be limiting. It will be readily understood that certain aspects of the disclosed systems and methods can be arranged and combined in a wide variety of different configurations.

I. Example Conversion of Audio Data to Tactile Outputs

[0036] Persons with diminished hearing, for example due to gradual or traumatic hearing loss or a congenital condition, may experience difficulty perceiving and interpreting speech or other useful sounds. While learned skills like lip reading may partially compensate in some cases, many individuals can benefit the use of assistive devices in order to reliably perceive and interpret speech. Such assistive devices may include hearing aids. However, in cases of severe or total hearing loss, hearing aids may be unable to functionally restore the ability to perceive and interpret speech. Devices exists that provide a visual indication of speech in the environment of a user, however, such devices partially obstruct the user’s field of view and/or induce the user to look away from their environment, toward the display, in order to perceive the predicted speech.

[0037] A variety of assistive devices are available that transform detected sounds into haptic stimulus. A haptic stimulus has the benefit of not occluding the user’s vision or requiring the user to wear a device on the user’s face or hands (thus avoiding associated negative social, physical comfort, and/or cosmetic effects). Additionally, haptic stimulus may be provided even to severely or completely hearing-impaired individuals. Such haptic stimulus may include one or more vibrators, solenoids, motors, heaters, haptic stimulator electrodes, or other haptic actuators configured to provide haptic stimulation to skin. The haptic actuator(s) of such a device may be mounted (e.g., via straps, adhesive, or other means) to skin on the back of a user’s hand, to a user’s arm, back, chest, torso, neck, or some other location where a haptic stimulus can be comfortably and reliably perceived.

[0038] Previous examples of such devices may operate in a variety of ways to convert a detected input sound into haptic stimulus. For example, such a device may convert the detected sounds into a frequency spectrum, and then provide a haptic stimulus to the user representing the determined frequency spectrum. However, these prior methods of operation fail to provide a haptic stimulus that is easy for a user to interpret and/or to use in successfully perceiving and interpreting speech.

[0039] The embodiments described herein provide an improved haptic stimulus based on detected sounds. This improved haptic stimulus is easier for users to interpret and improves accuracy of interpretation of detected speech sounds, e.g., in combination with lip reading. This improved haptic stimulus is generated by separating the detected sound into different part(s) of speech (e.g., vowel sounds, glottal pulses, fricatives) and then providing, based on a separated part of speech (e.g., vowel sounds), haptic stimulus at a corresponding location in a two- dimensional phoneme space (e.g., a two-dimensional vowel space). The magnitude of the provided haptic stimulus may be modulated according to the magnitude of the detected separated part of speech.

[0040] Mapping the detected sounds, within a particular part of speech, to a single location within a two-dimensional phoneme space improves the user’s ability to haptically perceive and distinguish different sounds across the locally two-dimensional surface of the user’s skin. Additionally, mapping the speech sounds to a continuous two-dimensional space can allow the haptic stimulus to easily and intuitively represent ambiguity in the detected sound (e.g., by providing a stimulus location that is between the locations of potential identified vowels, e.g., monophthong vowels), changes in the sound over time (e.g., representing a diphthong vowel as a motion, over time, within the two-dimensional phoneme space from a location of a first constituent monophthong vowel of the diphthong vowel to a location of a second constituent monophthong vowel of the diphthong vowel), or other semantically relevant content of the detected sound. Further, providing stimulus at a single location, rather than multiple locations (e.g., representing multiple different detected vowel sounds), simplifies the process of perception and interpretation on the part of the user.

[0041] This mapping of an input sound to a stimulus location in a two-dimensional phoneme space could include performing some preprocessing (e.g., filtering, transformation into an alternative frequency domain or time domain representation) on the input audio signal and then applying the pre-processed audio signal to an artificial neural network or some other machine learning algorithm in order to map the input audio signal to a stimulus location in the two-dimensional phoneme space. Figure 1A illustrates such a process, wherein an input spectrum 110, determined from a segment of an input audio sample, is applied to a trained encoder 120 in order to generate an output location, in a two-dimensional phoneme space, that can be used to drive a haptic actuator array 130 to provide a stimulus based on the generated output location. The input spectrum 110 may be determined from an input audio sample via a variety of processes, e.g., via short-time Fourier transform. The input spectrum 110 may represent all of the frequencies present in the input audio sample, or may be limited to a range of frequencies of interest (e.g., a range of frequencies between 500 hertz and 3000 hertz, or some other range of frequencies corresponding to vowel sounds).

[0042] The trained encoder 120 could include a variety of elements. In some examples, the trained encoder could include a trained machine learning algorithm that generates a location, in a two-dimensional phoneme space, based on the input audio spectrum 110. Figure IB illustrates an example of such a trained encoder 120b, which includes a trained artificial neural network 125b that receives, as an input, the input spectrum 110 and generates, as an output, a location in a two-dimensional phoneme space. Such a trained encoder could additionally or alternatively include a decision tree, a regression tree, a forest of decision trees and/or repression trees, a support vector machine, a k-nearest neighbors mapping algorithm, or some other trained machine learning algorithm.

[0043] Alternatively, the trained encoder 120 could include a trained machine learning algorithm that generates a number of outputs, based on the input audio spectrum 110, that are then used to determine a location in a two-dimensional phoneme space. Figure 1C illustrates an example of such a trained encoder 120c, which includes a trained artificial neural network 125c that receives, as an input, the input spectrum 110 and generates a number of outputs (e.g., likelihoods that the input represents each monophthong vowel in a set of monophthong vowels). These outputs are applied to a mapping module 127c that then generates a location in a two- dimensional phoneme space. Such a trained encoder could additionally or alternatively include a decision tree, a regression tree, a forest of decision trees and/or repression trees, a support vector machine, a k-nearest neighbors mapping algorithm, or some other trained machine learning algorithm.

[0044] Such a mapping module 127c could map outputs of a trained machine learning algorithm into a location in a two-dimensional phoneme space in a variety of ways. For example, each output could be associated with a respective location in the two-dimensional phoneme space, and the mapping module 127c generating a location in the two-dimensional phoneme space could include determining an average of each output’s respective location, weighted by each output’s respective value. This is illustrated by way of example in Figure 2A, which shows the locations (black dots of varying sizes) within a two-dimensional phoneme space 200 that are associated with respective outputs of, e.g., an artificial neural network. The value of the outputs are indicated by the relative size of the dots. Mapping those outputs to a location 210a in the two-dimensional phoneme space 200 includes taking a weighted average of the locations corresponding to the outputs, weighted according to the outputs’ respective values. Each output could correspond, e.g., to a respective monophthong vowel.

[0045] In some examples, this mapping could include discarding some of the outputs prior to determining the weighted average or applying some other nonlinear weighting to the output’s respective locations. For example, outputs whose values are below a threshold level could be discarded. In another example, the top n outputs (e.g., the top three outputs) could be retained and used to determine the weighted average. This is illustrated by way of example in Figure 2B, which shows the locations (dots of varying sizes) within a two-dimensional phoneme space 200 that are associated with respective outputs of, e.g., an artificial neural network. The value of the outputs are indicated by the relative size of the dots. Mapping those outputs to a location 210b in the two-dimensional phoneme space 200 includes selecting a subset of the outputs (illustrated by the filled dots), and then taking a weighted average of the locations corresponding to the selected outputs, weighted according to the selected outputs’ respective values. [0046] Such trained encoders could be trained in a variety of ways. The training could be supervised, using a plurality of training samples that include both an input audio sample or information determined therefrom (e.g., an audio spectrum) and a ‘correct’ output. In examples wherein the trained encoder is trained to directly generate locations in the two-dimensional phoneme space, the ‘correct’ outputs of the training samples could be specified locations, in the two-dimensional phoneme space, of phonemes (e.g., vowels) represented in the respective audio samples of the training samples. These locations could be specified according to a user preference, or according to a standard. For example, the two-dimensional phoneme space could be a two-dimensional vowel space, and the locations of the training samples could be specified according to the International Phonetic Alphabet vowel diagram of monophthong vowels. Figure 3 shows an example mapping of the International Phonetic Alphabet vowel diagram onto haptic actuators 300 (open dots) of a two-dimensional array of haptic actuators.

[0047] Alternatively, the training samples could be labelled with class values (e.g., the identity of monophthong vowels or some other class of phonemes of interest) and the training process could generate a mapping into the two-dimensional phoneme space in an unsupervised manner (e.g., the encoder being trained could include a self-organizing map). This training could include minimizing a cost function that rewards ‘similar’ input samples being mapped to nearby locations in the two-dimensional phoneme space and that penalizes ‘dissimilar’ input samples being mapped to nearby locations in the two-dimensional phoneme space.

[0048] In examples wherein the trained encoder is trained to generate some other output

(e.g., likelihoods that the input audio sample represents each one of a set of monophthong vowels), the ‘correct’ outputs of the training samples could be class values/identities (e.g., the identities of whichever monophthong vowel is represented in each training sample). Such outputs could each be associated with a respective location in the two-dimensional phoneme space (e.g., according to the International Phonetic Alphabet vowel diagram of monophthong vowels) and the stimulus location in the two-dimensional phoneme space could be determined based on the determined outputs and on the set of associated locations in the two-dimensional phoneme space. For example, the output location could be determined based on the location of the highest-valued output, an average of the locations associated with the outputs weighted according to the output values (e.g., as illustrated in Figure 2A), an average of the locations associated with a subset of the outputs weighted according to the output values (e.g., as illustrated in Figure 2B), or based on some other method.

[0049] Output locations in the two-dimensional phoneme space, as well as other information about a haptic stimulus (e.g., a stimulus magnitude determined based on a power in one or more frequency bands of the input audio sample) could be determined repeatedly over time, in order to provide a time-varying haptic stimulus that changes according to changes, over time, in an input audio signal. For example, respective locations in the two-dimensional phoneme space (and/or stimulus magnitudes or other stimulus parameters) could be determined for each of a plurality of segments of an input audio signal over time. Such audio segments could be non-overlapping (e.g., consecutive segments of the input audio signal) or could partially overlap. These output locations (and other determined stimulus parameters) could be filtered in some manner (e.g., lowpass filtered in two dimensions, applied to an inertial filter) prior to being applied to drive haptic actuators in a two-dimensional array of haptic actuators.

[0050] A determined output location in the two-dimensional phoneme space could be applied to drive a two-dimensional array of haptic actuators in a variety of ways. Each of the haptic actuators in the array could be associated with a respective location in the two- dimensional phoneme space, and the relationship between the locations of the haptic actuators and the determined stimulus location could be used to determine which haptic actuators to actuate, and to what magnitude.

[0051] In some examples, only the single haptic actuator having a location closest to the determined stimulus location may be driven. This is illustrated in Figure 4A, which shows a two-dimensional array of haptic actuators 400 (open and filled dots) and first 410a and second 415a determined stimulus locations (corresponding, e.g., to respective first and second segments of an input audio signal). Driving the array of haptic actuators 400 according to the first 410a determined stimulus location includes actuating a first haptic actuator 420a (filled dot) that has a corresponding location that is closest to the first 410a determined stimulus location out of the set of locations corresponding to the haptic actuators. Driving the array of haptic actuators 400 according to the second 415a determined stimulus location includes actuating a second haptic actuator 425a (filled dot) that has a corresponding location that is closest to the second 415a determined stimulus location out of the set of locations corresponding to the haptic actuators. [0052] In some examples, a set of haptic actuators having locations nearby to the determined stimulus location may be driven. This could include actuating all actuators having locations within a specified threshold distance from the determined stimulus location. In another example, a specified number (e.g., three) of the haptic actuators having locations that are closest to the determined stimulus location may be driven. This is illustrated in Figure 4B, which shows the two-dimensional array of haptic actuators 400 (open and filled dots) and first 410b and second 415b determined stimulus locations (corresponding, e.g., to respective first and second segments of an input audio signal). Driving the array of haptic actuators 400 according to the first 410b determined stimulus location includes actuating a first set of three haptic actuators 420b (filled dots) that have corresponding locations that are closest to the first 410b determined stimulus location out of the set of locations corresponding to the haptic actuators. Each of the driven actuators could be driven to provide the same magnitude stimulus (e.g., a set magnitude, or a magnitude determined based on a power in a specified band of frequencies in the corresponding segment of the input audio signal), or different magnitudes of stimulus (e.g., weighted according to distance from the first 410b determined stimulus location). Driving the array of haptic actuators 400 according to the second 415b determined stimulus location includes actuating a second set of three haptic actuators 425b (filled dots) that have corresponding locations that are closest to the second 415b determined stimulus location out of the set of locations corresponding to the haptic actuators.

[0053] Whatever method is used to map a determined stimulus location onto one or more haptic actuators of an array, the overall magnitude of the stimulus provided via the one or more actuators may be controlled at a specified level. This could include operating all of the one or more actuators at the same specified level or scaling the magnitude of stimulus provided via a set of actuators (e.g., weighted according to distance from the determined stimulus location) such that the total stimulus equals the specified level. Such a specified level could be set to a static level (e.g., a static level that is controllable via a user interface of a device). Alternatively, the specified level could be determined based on some property of the input audio signal.

[0054] In some examples, the specified level could be determined based on a power in a particular band of interest within the input audio sample. For example, the determined location of the haptic stimulus could be a location in a two-dimensional vowel space and the overall magnitude of the stimulus could be determined based on a power in a band of frequencies, in the input audio sample, that includes vowel sounds (e.g., a band of frequencies from 500 Hertz to 3000 Hertz). This has the benefit of providing additional information to a user regarding the magnitude of the vowel sounds (or other sounds of interest) within the input audio signal as well as gating the provided stimulus when no or minimal sound is present in the input audio sample. Additionally, the power in other frequency bands could be determined for other parts of speech, such as glottal pulses (e.g., from 0 Hertz to 500 Hertz) and/or fricatives (e.g., from 3000 Hertz to 6000 Hertz).

[0055] Figure 5A shows an example power signal 510 over time. The power signal 510 is determined based on the power within a band of interest within an input audio signal (e.g., a band that contains frequencies corresponding to vowel sounds, glottal pulses, fricatives, or some other speech or other sound of interest). For example, the power signal 510 could represent the power in a vowel-related band of frequencies over a period of time when a person made a dipthong vowel sound from time to time h (beginning, at time / /, with a first monophthong vowel and ending, at time t 2 , with a second monophthong vowel) and a third monophthong vowel sound at time t 3. Accordingly, one or more haptic actuators in an array of haptic actuators could be driven, over time, according to the power signal 510. This could be mapped to multiple different actuators over time, according to the determined location, within a two-dimensional phoneme space, of the haptic stimulus over time. This is illustrated in Figure 5B, which shows (a) the magnitude of actuation 520a of a haptic actuator associated with a location proximate to a location associated with the first monophthong vowel, (b) the magnitude of actuation 520b of a haptic actuator associated with a location proximate to a location associated with the second monophthong vowel, and (c) the magnitude of actuation 520c of a haptic actuator associated with a location proximate to a location associated with the third monophthong vowel.

[0056] A device could include multiple different two-dimensional arrays of haptic actuators, corresponding to respective different two-dimensional phoneme spaces for respective different parts of speech (e.g., a first array for vowel sounds, a second array for fricatives, and/or a third array for glottal pulses). Respective locations in the respective different two-dimensional phoneme spaces could be determined, using the methods described herein, and used to drive respective two-dimensional arrays of haptic actuators. Additionally or alternatively, certain parts of speech (e.g., fricatives, glottal pulses) could be represented by one or more haptic actuators in a different manner. For example, a device could include a two-dimensional array of haptic actuators operated, as described elsewhere herein, to provide a haptic stimulus having a location and magnitude corresponding to the identity and power of vowel sounds in an input audio signal. Such a device could include an additional one or more haptic actuators operated to provide a haptic stimulus having a magnitude corresponding to the power of some other speech sounds (e.g., glottal pulses, fricatives).

II. Example Systems

[00155] Computational functions (e.g., functions to obtain samples of audio, generate spectra and/or power levels based on such audio samples, to determine a location in a two- dimensional vowel or other phoneme space based on such audio information, and/or to operate an array of haptic actuators based on such a determined location) described herein may be performed by one or more computing systems. Such a computing system may be integrated into or take the form of a computing device, such as a mobile phone, tablet computer, laptop computer, wearable audio prosthetic, and/or programmable logic controller. For purposes of example, Figure 6 is a simplified block diagram showing some of the components of an example computing device 600 that may include a microphone 624 and an array of haptic actuators 626 The microphone 624 may include one or more condenser microphones, piezoelectric microphones, MEMs microphones, coil microphones, optical microphones, or otherwise- configured sound-sensitive sensor elements. The array of haptic actuators 626 may include one or more vibrators, solenoids, motors, electrohaptic electrodes, heaters, or other elements configured to provide, to skin of a human, a localized haptic stimulus. The array of haptic actuators 626 may include a plurality of haptic actuators arranged in a square grid, a hexagonal grid, or some other regular or irregular pattern.

[00156] Computing device 600 may be a wearable device or may include one or more wearable components. For example, the computing device 600 may include a mount or other elements configured to maintain the array of haptic actuators 626 in contact with skin of a user. [00157] By way of example and without limitation, computing device 600 include a cellular mobile telephone (e.g., a smartphone), an on-ear or in-ear hearing aid, a computer (such as a desktop, notebook, tablet, or handheld computer), a personal digital assistant (PDA), a wearable computing device, or some other type of device that may be equipped with at some information processing capabilities.

[00158] As shown in Figure 6, computing device 600 may include a communication interface 602, a user interface 604, a processor 606, data storage 608, microphone 624, and haptic actuator array 626, all of which may be communicatively linked together by a system bus, network, or other connection mechanism 610.

[00159] Communication interface 602 may function to allow computing device 600 to communicate, using analog or digital modulation of electric, magnetic, electromagnetic, optical, or other signals, with other devices, access networks, and/or transport networks. Thus, communication interface 602 may facilitate circuit-switched and/or packet-switched communication, such as plain old telephone service (POTS) communication and/or Internet protocol (IP) or other packetized communication. For instance, communication interface 602 may include a chipset and antenna arranged for wireless communication with a radio access network or an access point. Also, communication interface 602 may take the form of or include a wireline interface, such as an Ethernet, Universal Serial Bus (USB), or High-Definition Multimedia Interface (HDMI) port. Communication interface 602 may also take the form of or include a wireless interface, such as a Wifi, BLUETOOTH®, global positioning system (GPS), or wide-area wireless interface (e.g., WiMAX or 3GPP Long-Term Evolution (LTE)). However, other forms of physical layer interfaces and other types of standard or proprietary communication protocols may be used over communication interface 602. Furthermore, communication interface 602 may comprise multiple physical communication interfaces (e.g., a Wifi interface, a BLUETOOTH® interface, and a wide-area wireless interface).

[00160] In some embodiments, communication interface 602 may function to allow computing device 600 to communicate, with other devices, remote servers, access networks, and/or transport networks. For example, the communication interface 602 may function to receive an indication of an audio signal detected by a microphone of another device that is in communication with the computing device 600 the communication interface 602. Additionally or alternatively, the communication interface 602 may function to transmit an indication of a determined location in a two-dimensional phoneme space, a determine power within a specified band of an audio signal, a determined set of actuator outputs, or some other information that could be used, by another device that is in communication with the computing device 600 via the communication interface 602, to operate an array of haptic actuators.

[00161] User interface 604 may function to allow computing device 600 to interact with a user, for example to receive input from and/or to provide output to the user. Thus, user interface 604 may include input components such as a keypad, keyboard, touch-sensitive or presence- sensitive panel, computer mouse, trackball, joystick, microphone, instrumented glove, force- feedback devices, and so on. User interface 604 may also include one or more output components such as haptic outputs, force-feedback outputs, or a display screen which, for example, may be an augmented reality screen that permits a user to also view the environment of the user through the display screen. The display screen may be based on CRT, LCD, and/or LED technologies, or other technologies now known or later developed. User interface 604 may also be configured to generate audible output(s), via a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices.

[00162] Processor 606 may comprise one or more general purpose processors - e.g., microprocessors - and/or one or more special purpose processors - e.g., digital signal processors (DSPs), graphics processing units (GPUs), floating point units (FPUs), tensor processing units (TPUs), network processors, or application-specific integrated circuits (ASICs). In some instances, special purpose processors may be capable of audio processing and neural network or other machine learning algorithm computation, among other applications or functions. Data storage 608 may include one or more volatile and/or non-volatile storage components, such as magnetic, optical, flash, or organic storage, and may be integrated in whole or in part with processor 606. Data storage 608 may include removable and/or non-removable components. [00163] Processor 606 may be capable of executing program instructions 618 (e.g., compiled or non-compiled program logic and/or machine code) stored in data storage 608 to carry out the various functions described herein. Therefore, data storage 608 may include a non- transitory computer-readable medium, having stored thereon program instructions that, upon execution by computing device 600, cause computing device 600 to carry out any of the methods, processes, or functions disclosed in this specification and/or the accompanying drawings. [00164] By way of example, program instructions 618 may include an operating system 622 (e.g., an operating system kernel, device driver(s), and/or other modules) and one or more application programs 620 (e.g., audio processing functions, neural network or other machine learning algorithm functions) installed on computing device 600.

[00165] Application programs 620 may take the form of “apps” that could be downloadable to computing device 600 through one or more online application stores or application markets (via, e.g., the communication interface 602). However, application programs can also be installed on computing device 600 in other ways, such as via a web browser or through a physical interface (e.g., a USB port) of the computing device 600.

III. Example Methods

[00166] Figure 7 is a flowchart of a method 700 for providing a two-dimensional haptic stimulus based on an input audio signal. The method 700 includes obtaining an audio sample (510). For example, the audio sample may be obtained by operating a microphone.

[00167] The method 700 additionally includes determining, based on a first segment of the audio sample, a first power in a first band of frequencies, wherein the first band of frequencies comprises frequencies corresponding to a set of predetermined phonemes (720). The predetermined phonemes may be vowel sounds or other speech sounds.

[00168] The method 700 additionally includes determining, based on the first segment of the audio sample, a first audio spectrum, wherein the first audio spectrum comprises a plurality of spectral components in the first band of frequencies (730).

[00169] The method 700 additionally includes applying the first audio spectrum to a trained encoder to generate a first location in a two-dimensional phoneme space (740).

[00170] The method 700 additionally includes outputting the first power and first location for operation of a two-dimensional array of haptic actuators based on the determined first power and the generated first location, wherein each haptic actuator of the two-dimensional array of haptic actuators corresponds to a respective location within the two-dimensional phoneme space, wherein operation of the two-dimensional array of haptic actuators based upon the first power and first location comprises actuating, at an amplitude based on the first power, at least one of the haptic actuators that corresponds to a location within the two-dimensional phoneme space that is proximate to the first location (750). The method may further include operating the two- dimensional array of haptic actuators.

[00171] The method 700 could include additional elements or features.

IV. Conclusion

[0057] The above detailed description describes various features and functions of the disclosed systems, devices, and methods with reference to the accompanying figures. In the figures, similar symbols typically identify similar components, unless context indicates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.

[0058] With respect to any or all of the message flow diagrams, scenarios, and flowcharts in the figures and as discussed herein, each step, block and/or communication may represent a processing of information and/or a transmission of information in accordance with example embodiments. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, functions described as steps, blocks, transmissions, communications, requests, responses, and/or messages may be executed out of order from that shown or discussed, including in substantially concurrent or in reverse order, depending on the functionality involved. Further, more or fewer steps, blocks and/or functions may be used with any of the message flow diagrams, scenarios, and flow charts discussed herein, and these message flow diagrams, scenarios, and flow charts may be combined with one another, in part or in whole.

[0059] A step or block that represents a processing of information may correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a step or block that represents a processing of information may correspond to a module, a segment, or a portion of program code (including related data). The program code may include one or more instructions executable by a processor for implementing specific logical functions or actions in the method or technique. The program code and/or related data may be stored on any type of computer-readable medium, such as a storage device, including a disk drive, a hard drive, or other storage media.

[0060] The computer-readable medium may also include non-transitory computer- readable media such as computer-readable media that stores data for short periods of time like register memory, processor cache, and/or random access memory (RAM). The computer- readable media may also include non-transitory computer-readable media that stores program code and/or data for longer periods of time, such as secondary or persistent long term storage, like read only memory (ROM), optical or magnetic disks, and/or compact-disc read only memory (CD-ROM), for example. The computer-readable media may also be any other volatile or non-volatile storage systems. A computer-readable medium may be considered a computer- readable storage medium, for example, or a tangible storage device.

[0061] Moreover, a step or block that represents one or more information transmissions may correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions may be between software modules and/or hardware modules in different physical devices.

[0062] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope being indicated by the following claims.