Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VOICE ACTIVITY DETECTOR AND METHODS THEREFOR
Document Type and Number:
WIPO Patent Application WO/2018/152034
Kind Code:
A1
Abstract:
Methods, systems, and apparatuses for a low-complexity acoustic activity detector are disclosed. A method includes forming a sequence of frames by blocking digital data representative of acoustic activity. For each frame, the method includes determining a plurality of power metrics based on a transformation of the frame data from the time-domain to the frequency domain using a discrete Fourier transform having constant coefficients dependent on a plurality of select frequencies within a range of voice frequencies. For each frame, the method also includes determining a plurality of signal to noise ratios for each power metric to a corresponding noise metric. The method also includes, for each frame, determining one or more signal to noise ratios. The method includes determining whether the digital data representative of the acoustic activity includes voice activity by determining whether the signal to noise ratios for each of a plurality of frames satisfies a criterion.

Inventors:
PATURI ROHIT (US)
YE ANNE (US)
RUB LEONARDO (US)
LAROCHE JEAN (US)
NEMALA SRIDHAR KRISHNA (US)
Application Number:
PCT/US2018/017700
Publication Date:
August 23, 2018
Filing Date:
February 09, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
KNOWLES ELECTRONICS LLC (US)
International Classes:
G10L15/20; G06F17/14; G10L15/22; G10L25/03; G10L25/18; G10L25/21; G10L25/78; G10L25/84; G10L25/87; H04R3/00
Foreign References:
US5963901A1999-10-05
US20150243300A12015-08-27
US6453291B12002-09-17
Attorney, Agent or Firm:
BELDEN, Brett P. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1 . A method in a voice activity detector, the method comprising:

forming a sequence of frames having frame data by blocking digital data representative of acoustic activity;

for each frame of the sequence of frames:

determining a plurality of power metrics based on a transformation of the frame data from the time-domain to the frequency domain using a discrete Fourier transform having constant coefficients dependent on a plurality of select frequencies within a range of voice frequencies, each power metric determined at a corresponding one of the plurality of select frequencies; and determining a plurality of signal to noise ratios, each signal to noise ratio being a ratio of one of the plurality of power metrics to a corresponding noise metric at substantially the same frequency; and

determining whether the digital data representative of the acoustic activity includes voice activity by determining whether the plurality of signal to noise ratios for each of a plurality of frames satisfies a criterion.

2. The method of Claim 1 , further comprising determining the plurality of power metrics at the corresponding plurality of select frequencies using only the most significant bits of the transformed frame data.

3. The method of Claim 2, further comprising:

digitizing analog data received from an electro-acoustic transducer;

generating the digital data representative of acoustic activity by decimating the digitized data before forming the sequence of frames;

determining the plurality of power metrics based on the transformation of the frame data using the discrete Fourier transform implemented using integrated hardware gates, wherein the plurality of select frequencies is not more than 5 frequencies within a range between 1 ,500 Hz and 3,000 Hz; and determining that the digital data representative of acoustic activity includes voice activity when the plurality of signal to noise ratios of at least two consecutive frames satisfies the criterion.

4. The method of Claim 1 , further comprising:

receiving analog data representative of acoustic activity detected by an electro-acoustic transducer;

digitizing the analog data; and

generating the digital data representative of the acoustic activity by

decimating the digitized data before forming the sequence of frames, wherein the decimating is performed at a lowest sampling frequency that includes the select frequencies within a range between 1 ,400 Hz and 5,000 Hz.

5. The method of Claim 1 , further comprising providing an indication that the digital data representative of the acoustic activity includes voice activity when the plurality of signal to noise ratios of at least two consecutive frames satisfies the criterion.

6. The method of Claim 1 , further comprising:

providing a voice activity detection indication when at least two consecutive frames satisfy the criterion; and

changing a status of the voice activity detection indication only after a specified plurality of frames subsequent to the two consecutive frames do not satisfy the criterion.

7. The method of Claim 1 , further comprising, for each frame of the sequence of frames, updating a noise spectrum representative of the noise metric based on a probability that the frame data includes noise at the plurality of select frequencies, wherein the noise spectrum is updated based on the probability that the frame data of a particular frame includes noise at the plurality of select frequencies before determining the plurality of signal to noise ratios for the particular frame.

8. The method of Claim 7, further comprising for each frame of the sequence of frames, determining the probability that the frame data includes noise at the plurality of select frequencies using a piecewise linear sigmoid function.

9. The method of Claim 1 , further comprising, before determining whether the digital data representative of the acoustic activity includes voice activity, initializing a noise spectrum by determining a plurality of noise metrics, each noise metric determined at a corresponding one of the plurality of select frequencies.

10. The method of Claim 1 , further comprising for each frame of the sequence of frames, determining an aggregated signal to noise ratio based on the plurality of signal to noise ratios, wherein the determining whether the plurality of signal to noise ratios for each of the plurality of frames satisfies a criterion includes determining whether the aggregated signal to noise ratio for each of the plurality of frames satisfies the criterion.

1 1 . An acoustic signal processing circuit comprising:

an input configured to receive a signal representative of acoustic activity; a voice activity detector comprising less than 100,000 integrated hardware gates and configured to form a sequence of frames from digital data representative of the acoustic activity;

for each frame of the sequence of frames, the voice activity detector configured to:

determine a plurality of power metrics based on frame data transformed by a discrete Fourier transform implemented using the integrated hardware gates, the discrete Fourier transform having constant coefficients dependent on a plurality of select frequencies within a range of voice frequencies, each power metric determined at a corresponding one of the plurality of select frequencies; and

determine a plurality of signal to noise ratios, each signal to noise ratio being a ratio of one of the plurality of power metrics to a corresponding noise metric at substantially the same frequency; wherein the voice activity detector is further configured to determine whether the digital data representative of the acoustic activity includes voice activity by determining whether the plurality of signal to noise ratios for each of a plurality of frames satisfies a criterion.

12. The processing circuit of Claim 1 1 , further comprising memory, wherein the voice activity detector is configured to store only most significant bits of the transformed frame data in the memory and determine the plurality of power metrics at the corresponding plurality of select frequencies based on only the most significant bits of the transformed frame data.

13. The processing circuit of Claim 12, further comprising:

an A/D converter configured to digitize analog data representative of the acoustic activity; and

a decimator configured to generate the digital data representative of the acoustic activity by decimating the digitized analog data before formation of the sequence of frames;

wherein the integrated hardware gates number fewer than 50,000 and the plurality of select frequencies is not more than 5 frequencies within a range between 1 ,500 Hz and 3,500 Hz.

14. The processing circuit of Claim 1 1 , wherein the voice activity detector is configured to provide a voice activity detection indication when the plurality of signal to noise ratios of each of a plurality of at least two consecutive frames satisfies the criterion.

15. The processing circuit of Claim 14, wherein the voice activity detector is configured to change a status of the voice activity detection indication only after a specified plurality of frames subsequent to the two consecutive frames do not satisfy the criterion.

16. The processing circuit of Claim 1 1 , wherein, for each frame of the sequence of frames, the voice activity detector is configured to update the noise metric based on a probability that the frame data includes noise at the plurality of select frequencies.

17. The processing circuit of Claim 16, further comprising a piecewise linear sigmoid function, wherein, for each frame of the sequence of frames, the voice activity detector is configured to determine the probability that the frame data includes noise at the plurality of select frequencies using the piecewise linear sigmoid function.

18. The processing circuit of Claim 16, wherein, before determining whether the digital data representative of the acoustic activity includes voice activity, the voice activity detector is configured to initialize a noise spectrum by determining a plurality of noise metrics for a specified plurality of frames in the sequence of frames, each noise metric determined for a corresponding one of the plurality of select frequencies.

19. The processing circuit of Claim 1 1 , wherein the voice activity detector is configured to, for each frame of the sequence of frames, determine an aggregated signal to noise ratio based on the plurality of signal to noise ratios, and wherein to determine whether the plurality of signal to noise ratios for each of the plurality of frames satisfies a criterion includes determining whether the aggregated signal to noise ratio for each of the plurality of frames satisfies the criterion.

20. A microphone assembly comprising:

a housing having a surface-mount electrical interface;

an electro-acoustic transducer disposed at least partially in the housing; and an electrical circuit disposed in the housing and coupled to the electro- acoustic transducer, the electrical circuit including an analog-to-digital converter configured to digitize an electrical signal output by the electro-acoustic transducer, the electrical circuit including an acoustic activity detector having integrated hardware gates, the acoustic activity detector configured to:

form a sequence of frames having frame data based on the digitized electrical signal; for each frame of the sequence of frames:

determine a plurality of power metrics based on frame data transformed by a discrete Fourier transform implemented using the integrated hardware gates, the discrete Fourier transform having constant coefficients dependent on a plurality of select frequencies within a range of voice frequencies, wherein each power metric is determined at a corresponding one of the plurality of select frequencies; and

determine a plurality of signal to noise ratios, each signal to noise ratio being a ratio of one of the plurality of power metrics to a

corresponding noise metric at substantially the same frequency; and

determine whether the electrical signal represents voice activity based on whether the plurality of signal to noise ratios of each of a plurality of frames satisfies a criterion.

21 . The microphone assembly of Claim 20, the electrical circuit including memory, wherein the voice activity detector is configured to store only most significant bits of the transformed frame data in the memory and determine the plurality of power metrics at the corresponding plurality of select frequencies based on only the most significant bits of the transformed frame data.

22. The microphone assembly of Claim 21 , the electrical circuit including a decimator configured to decimate the digitized electrical signal before formation of the sequence of frames, wherein the plurality of select frequencies is not more than 5 frequencies within a range between 1 ,500 Hz and 3,000 Hz.

23. The microphone assembly of Claim 20, wherein the electrical circuit is configured to provide a signal at the electrical interface when the plurality of signal to noise ratios of each of at least two consecutive frames satisfies the criterion, and wherein the electrical circuit is configured to change a status of the signal at the electrical interface only after a specified plurality of frames subsequent to the two consecutive frames do not satisfy the criterion.

24. The microphone assembly of Claim 20, wherein, for each frame of the sequence of frames, the voice activity detector is configured to update a noise spectrum comprising the noise metric based on a probability that the frame data includes noise at the plurality of select frequencies.

25. The microphone assembly of Claim 20, wherein the acoustic activity detector is configured to, for each frame of the sequence of frames, determine an aggregated signal to noise ratio based on the plurality of signal to noise ratios, and wherein to determine whether the electrical signal represents voice activity, the acoustic activity detector is configured to determine whether the aggregated signal to noise ratio for each of the plurality of frames satisfies the criterion.

Description:
VOICE ACTIVITY DETECTOR AND METHODS THEREFOR

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of and priority to U.S. Provisional Patent Application No. 62/458,950, filed February 14, 2017, the entire contents of which are incorporated herein by reference.

FIELD OF THE DISCLOSURE

[0002] The present disclosure relates generally to voice activity detection and, more particularly, to microphone components, electrical circuits, and methods for detecting voice activity.

BACKGROUND

[0003] The following description is provided to assist the understanding of the reader. None of the information provided or references cited is admitted to be prior art.

[0004] Voice control has been increasingly adopted as a favored mode of interaction with a variety of electronic devices including wireless communication handsets, tablets, and laptop personal computers (PCs), among other devices. In some applications, voice activity detection is a prelude to voice or speech detection. Voice activity can be characterized as voice versus noise discrimination whereas voice or speech detection refers to the detection of speech or components of speech including, for example, phonemes, keywords, voice commands, and phrases.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The objects, features and advantages of the present disclosure will become more fully apparent to those having ordinary skill in the art upon careful consideration of the following Detailed Description and appended claims in conjunction with the accompanying drawings.

[0006] FIG. 1 is a block diagram of a voice activity detector embedded in a microphone assembly or in a host device, according to some embodiments.

[0007] FIG. 2 is a schematic functional block and process diagram for voice activity detection, according to some embodiments.

[0008] FIGS. 3A and 3B are spectrograms of amplitude versus time and frequency versus time, respectively, for clean speech, according to some embodiments.

[0009] FIGS. 4A and 4B are spectrograms of amplitude versus time and frequency versus time, respectively, for noisy speech, according to some embodiments.

[0010] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which are explicitly contemplated and make part of this disclosure.

DETAILED DESCRIPTION

[0011] FIG. 1 is a block diagram of a voice activity detector embedded in a microphone assembly or in a host device. A voice interactive system 100 comprises generally an electro-acoustic transducer 105, a voice activity detection circuit 106, and a host device. The voice activity detection circuit 106 may be implemented in various system configurations depending on the particular application. Examples of such applications include, but are not limited to, voice interaction with cellphones, tablets, laptops and notebook computers, desktop computers, gaming devices or stations, handheld or fixed-location remote control devices, and wearable devices like smart watches, among other devices. Other applications include voice interaction with appliances like refrigerators, ovens, washers, dryers, and other durable goods as well in as industrial machines and ground, water, air, and space vehicles.

[0012] In one system configuration, illustrated in FIG. 1 , the voice activity detector 106 is integrated with a microphone assembly 125 including the electro-acoustic transducer 105 and an electrical circuit 1 10 disposed in, at least partially, an enclosed housing that includes a surface mount or other interface for integration with a host device. The electrical circuit 1 10 includes at least the voice activity detector 106. In some embodiments, the electrical circuit 1 10 also includes other circuits for functions including, but not limited to, amplification, filtering, buffering, and A/D conversion, among other processing of analog and digital signals representative of acoustic information before and after voice activity detection. The electrical circuit 1 10 may be embodied as one or more integrated circuits and the transducer 105 may be embodied as a microelectromechanical systems (MEMS) sensor or die disposed within a housing (e.g., mounted on a cover or substrate) of the microphone assembly 125. The MEMS sensor may be implemented as a capacitive (also referred to as a condenser) sensor or as a piezoelectric (also referred to as a crystal) sensor. Other types of transducers may be used alternatively. In FIG. 1 , the host device (not shown in its entirety) includes a processor or CPU 120 with which the microphone assembly 125 communicates. As suggested, the microphone assembly 125 may be integrated with (e.g., by embedding in) a host device embodied as a cellphone, laptop or notebook computer, tablet, gaming device or station, handheld or fixed-location remote control device, wearable device like a smart watch, an appliance like a refrigerator, oven, washer, dryer, and other durable goods as well as industrial machines and ground, air, and water vehicles, among other devices.

[0013] In an alternative embodiment, the voice activity detector 106 is implemented by the host CPU 120 or by another electrical circuit (separate from the CPU 120) integrated with the host device, instead of integrating the voice activity detector with the microphone assembly 125. These alternative implementations are illustrated schematically by the broken line 122 in FIG. 1 . In these embodiments, all or a portion of the processing circuitry 1 15 may be implemented either in the host device or in the microphone assembly 125 with the transducer 105 to perform functions described herein. While the voice activity detectors described herein are suitable for

applications with space and power constraints, the disclosed voice activity detectors may also be used in other system architectures and applications.

[0014] FIG. 2 is a schematic functional block and process diagram for a voice activity detection process 200 amenable to various implementations. The arrows and/or layout of the diagram is not meant to be limiting with respect to the order or flow of information. For example, in alternative embodiments, two or more operations may be performed simultaneously.

[0015] In FIG. 2, audio information is provided by a conditioning block 205 to a frame blocking element 210. Some or all of the function of the conditioning block 205 may be performed by the processing circuit 1 15 shown in FIG. 1 . The conditioning block 205 may not be part of the voice activity detector. In one implementation, the audio information provided to the frame blocking element 210 in FIG. 2 is a stream of digital data representative of an audio or acoustic signal generated by an audio transducer such as the MEMS microphone as discussed herein. The digital data may have any suitable form including, for example, pulse-code modulation (PCM), pulse-density modulation (PDM), or some other data format suitable for processing by the voice activity detector. In some embodiments, the audio data is decimated after digitization to reduce its sampling rate before transmission to the frame blocking element 210. Decimation can be performed by the processing circuitry 1 15 in FIG. 1 . Generally, the sampling rate may be more or less than the Nyquist rate, depending on system constraints (e.g., memory, processing, hardware, space, etc.) and application requirements (e.g., latency). In one implementation, for example, the decimation frequency is a lowest frequency required to satisfy specification

requirements imposed on the voice activity detector. Some examples are discussed further herein. [0016] In FIG. 2, the frame blocking element 210 generates a sequence of audio frames based on the audio information received from the conditioning block 205. That is, a digital data stream representative of acoustic information is broken up into multiple frames wherein each frame includes a temporal portion of the data stream. In one implementation, there is no overlap in data among neighboring frames. In another implementation, there is some overlap among data at the boundaries of some or all neighboring frames. Such overlap may facilitate processing of the frame data, but is not required. In one implementation, the overlap in data between consecutive frames helps to ensure that there is no or minimal data lost during the domain transformation (discussed in greater detail below). In implementations that do not have overlap in data between consecutive frames, the overall complexity of the electrical circuit 1 10 may be reduced compared to implementations that do have such overlap. For example, having overlap in the data may require more memory to store data, thereby requiring additional logic gates.

[0017] The duration of the frame may be on the order of milliseconds, more or less. In one implementation, the audio frame has five (5) milliseconds of frame data sampled at a rate of 8 kHz resulting in 40 audio samples. The number of audio samples depends generally on frame size and the sampling rate. The frame size may be different in other embodiments, and the frame data may be sampled at a different rate, e.g., 16 kHz, 32 kHz, etc.

[0018] In some embodiments, a plurality of power metrics is determined for each of a plurality of frames based on a transformation of frame data from the time domain to the frequency domain. More specifically, for each frame, the plurality of power metrics (e.g., power intensity) are determined for a corresponding plurality of frequencies (e.g., frequency bins) within a range of frequencies typically associated with voice activity. The result of the transformation at each frequency is a complex number (i.e., number with a real portion and an imaginary portion). In one

embodiment, the power metric is represented by the magnitude of the complex number. Other power metrics may be used alternatively. For example, in some implementations, the power metric used may be an absolute sum of the real and imaginary portions of the complex number (e.g., as a result of a discrete Fourier transform). The transformed frame data includes, for each frequency bin, one or more most significant bits (MSBs) and one or more least significant bits (LSBs).

Some or all of this transformed data may be used to determine the power metrics. The transformed data may be stored in memory, (e.g., in registers local to the voice activity detector) for or during processing. In one embodiment, only the MSBs are used to determine the power metrics. In this embodiment, the LSBs are truncated. For example, the lower 8 bits are truncated of a 32 bit word. In such an example, 24 bits of the transformed data words are used. In the implementation in which the lower 8 bits are truncated, 480 bits (or 60 bytes) are stored. In alternative

embodiments, any suitable number of bits may be truncated, and any suitable size of frame data may be used. Thus, storage and processing of LSBs is not required, thereby reducing hardware and software resource requirements. For example, as discussed in greater detail below, implementation of the functional block and process diagram of FIG. 2 may be implemented using 32,500 logic gates, including the memory used to store the transformed frame data.

[0019] In FIG. 2, frames of the sequence of frames are transmitted to a domain transformation block 215 for transformation of frame data in each frame from the time domain into the frequency domain. In one embodiment, the transformation is performed using a discrete Fourier transform (DFT). Other transforms may be used alternatively. One such transform is the cochlear transform. The DFT can be implemented in hardware relatively efficiently and is described further herein. The DFT parameters are determined based on a combination of coefficients coordinated with frequencies of interest and a Henning window parameter. In one

implementation, DFT coefficients are constant and dependent on the plurality of select frequencies (e.g., frequency bins). In one implementation, the transformation is performed using integrated hardware gates of the voice activity detector as discussed further herein.

[0020] Generally, the frequency range may be determined based on empirical data, modeling, or it may be customized for a particular user using a learning algorithm. The lower end of the frequency range may be selected to exclude frequencies associated with interfering noise. Thus, the lower end of the frequency range may be selected based on estimated or measured noise for a particular application (e.g., background noise typical of cellphone use, road noise for in-vehicle use, etc.). In some embodiments, the frequency range or cut-off frequencies may be dynamically adjusted based on ongoing periodic measures of ambient noise. In an illustrative embodiment, the frequency band for the range of frequencies is between

approximately 1 .0 kilo Hertz (kHz) and approximately 5.0 kHz. In another

embodiment, the frequency range is between approximately 1.4 kHz and

approximately 3.0 kHz. In other embodiments, the bandwidth and boundary frequencies may be more or less depending on the requirements of the particular application. For example, the lower frequency range may be increased from 1 .4 kHz to 1.5 kHz to exclude problematic noise at lower frequencies. The Nyquist rate criterion would be satisfied for these frequency ranges when sampled at 8 kHz. A higher sampling rate may be required to satisfy the Nyquist rate criterion, if desired, for greater bandwidths. A determination of the power metrics at several frequencies within a frequency range between approximately 1 .500 kHz and approximately 3.0 kHz has been found to be suitable for some applications. In one embodiment, power metrics are determined for not more than five frequency bins in this range.

Exemplary frequency bins for this range are 1 .5 kHz, 1 .8 kHz, 2.2 kHz, 2.6 kHz, and 3.0 kHz.

[0021] In FIG. 2 at block 225, for each frame, a plurality of signal to noise ratios (SNRs) are determined for each of the corresponding frequencies. Each SNR is a ratio of a power metric at a particular frequency and a corresponding noise metric at substantially the same frequency. An aggregated SNR is also determined for each frame based on the corresponding plurality of signal to noise ratios determined for the frame. The aggregated signal to noise ratio may be a weighted average or a non-weighted average of the signal to noise ratios determined for the corresponding frame. Alternatively, the aggregated signal to noise ratio may be a summation of weighted or non-weighted signal to noise ratios determined for the corresponding frame. [0022] In an alternative embodiment, an aggregated SNR is not determined for each frame. In such an embodiment, instead of comparing an aggregated SNR for each frame to an aggregated SNR for the noise spectrum (discussed in greater detail below), each SNR for each frame is compared to a corresponding SNR for the noise spectrum. For example, in the embodiment in which an SNR for a frame is computed for five frequencies, each SNR is compared to an SNR of the noise spectrum at the respective one of the five frequencies. In alternative embodiments, any other suitable comparison may be used. For example, each SNR for the frames may be compared to one or more corresponding references (e.g., predetermined limits) to determine if there is voice activity in the data of the corresponding frame.

[0023] The relationship between the number of frequency bins and voice activity detection accuracy is not necessarily linear. As suggested, in other embodiments, the power metrics could be determined for more than five frequency bins, the frequency bins may have different locations, and the frequency range may be different. Also, any one or more of the frequency ranges, or the locations of the frequency bins within a range, or the number of frequencies for which power metrics are determined may be changed dynamically to improve accuracy or conserve resources (e.g., power consumption). Alternatively, different weightings could be applied to the calculated power metrics to improve performance. For example, if the algorithm erroneously predicts voice activity some specified number of times in a given time interval, the frequency range, location of the frequency bins, or the number of bins within the range could be changed to improve accuracy. Similarly, the number of frequencies for which power metrics are determined could be decreased if the predictions are consistently accurate. The frequency range and frequency bins may also be changed dynamically based on periodic assessments of the detection accuracy.

[0024] The ability of the voice activity detector to accurately detect the presence of voice activity depends on prior initialization of the noise spectrum. The noise spectrum is initialized by aggregating power metrics at specified frequencies for a specified amount of frame data (e.g., a specified number of frames). Generally, the noise spectrum is determined for the same frequency bins or substantially the same frequency bins at which power metrics are obtained for the SNR computations. In one of the examples above the noise spectrum is an aggregation of noise data for frequency bins at 1 .5 kHz, 1 .8 kHz, 2.2 kHz, 2.6 kHz, and 3.0 kHz. The noise spectrum can be based on an average, a weighted average, or a summation of weighted or non-weighted noise data obtained at each of the selected frequencies for each frame. In one implementation, the noise spectrum is determined using an infinite impulse response (MR) filter, wherein more recent data is weighted more heavily than older data. Other filters may be used alternatively. As suggested above, in some embodiments, the generation of the noise spectrum may be performed using only MSBs of the transformed frame data to reduce resource usage.

[0025] The number of frames required to initialize the noise spectrum depends generally on an amount of frame data required to obtain a reasonably accurate measure of the noise spectrum for a particular application. This may depend on the nature or predictability of the noise and frame size, among other factors. In one application, sixteen 5 millisecond frames are used for initializing the noise spectrum. In other applications, more or less frames having the same or different frame sizes may be used based on considerations described herein. For any particular application, the number of frames used to initialize the noise spectrum may be determined using empirical data or using a model of the noise for the application. This information may be coded into the device at the time of manufacture. In some embodiments, the information can also be programmed after the device has been manufactured. The number of frames required to initialize the noise spectrum may also be based on a learning algorithm that revises the frame threshold count based on past performance of the voice activity detector or based on context (e.g., in car, crowded bistro, etc.).

[0026] In FIG. 2, at block 220, a determination is made as to whether a sufficient number of frames have been processed to initialize the noise spectrum. If fewer than a specified minimum number of frames have been processed, then noise spectrum initialization proceeds at block 230. This iteration occurs until the noise spectrum initialization condition at block 220 is satisfied. At block 220, if a threshold number of frames have been processed indicating that the noise spectrum has been initialized, then the noise spectrum may be updated with power metrics representative of noise (e.g., noise data) from the most recently processed frame at block 235. The noise spectrum may be updated using an MR or other filter.

[0027] Generally, for a particular frequency, power metrics representative of noise of the most recently processed frame are weighted less than the frequency-specific noise metric of the frequency spectrum. For example, a weight of approximately 90% may be applied to the frequency-specific noise metric of the frequency spectrum and a weight of approximately 10% may be applied to the noise data from the most recently processed frame. Other weighting apportionments may also be used. For example, the weighting applied to the power metrics from a most recently processed frame may be based on a likelihood that the power metric is representative of noise. In this example, the weighting is lower where the probability of noise is lower. At an extreme, the weighting could be zero. Alternatively, the weighting is higher where the probability of noise is higher.

[0028] In one implementation, a piecewise-linear approximation of the sigmoid function is used. For example, the sigmoid function may be derived using a Minimum Mean Square Error (MMSE) estimation and/or a Gaussian approximation for speech and noise distribution. In one implementation, the sigmoid function is predetermined. Alternatively, a function other than a sigmoid function can be used. Using a linear approximation of a non-linear function may use fewer hardware and software resources and may decrease the computational latency, etc.

[0029] In an illustrative embodiment, the following formula is used to update the noise spectrum:

UpdatedMetric = (1— P) PreviousMetric + P NewMetric where UpdatedMetric is a value stored for the updated noise metric at a particular frequency, P is the probability that the current frame under consideration does not contain a voice, PreviousMetric is a value stored for the previously determined noise metric at the particular frequency, and NewMetric is the power metric determined for the current frame at the particular frequency. Accordingly, the new noise spectrum contains a noise metric at each of the chosen frequencies.

[0032] In FIG. 2, at block 235, the noise spectrum is updated with frequency- specific noise data from the current frame. These frequencies correspond to each of the plurality of frequencies for which the power metrics are determined. At block 225, for each frequency, the updated noise spectrum is used to determine a ratio of the power metric of the current frame and a corresponding noise metric of the updated noise spectrum at each frequency bin. The SNRs for each frame are then

aggregated as discussed above. Generally, the noise spectrum is updated with noise data, if any, from the current frame before a determination is made about the presence of voice activity in the current frame. In other embodiments, however the noise spectrum may be updated afterwards or not at all depending on the likelihood that the current frame includes noise.

[0030] The voice activity detector determines whether voice activity is present in data representative of an acoustic signal by determining whether the aggregated SNR for each of a plurality of frames satisfies a criterion. In FIG. 2 at block 225, the aggregated SNR for each frame is compared to a reference to determine the likely presence of voice. At block 240, when the aggregated SNR of a specified number of consecutive frames exceeds the reference, voice activity is likely and a voice activity detection signal is generated at block 245. In one embodiment, acoustic activity is likely present when the aggregated signal to noise ratio of a plurality of at least two consecutive frames satisfies the criterion. In other embodiments, acoustic activity is likely present when the aggregated signal to noise ratio of a plurality of at least five consecutive frames satisfies the criterion. The consecutive frame count criterion depends in part on the duration of the frame data in each frame, among other factors. There may be a trade-off between voice detection accuracy and latency. If the voice activity detection accuracy is unsatisfactory, the number of consecutive frames considered may be increased, the reference may be increased, or both. Conversely, if voice activity detection accuracy is satisfactory, the frame count criterion may be reduced in an effort to reduce latency. After successfully detecting voice activity, the voice activity detector is configured to assume the continued presence of voice activity for a specified number of subsequent frames even though the subsequent frames do not satisfy the criterion. In one implementation, the block 240 uses seven bits to determine whether the number of consecutive frames that include voice activity is above a specified number (e.g., the seven bits can be used to count up to the specified number.) In alternative implementations, any other suitable number of bits may be used.

[0031] In FIG. 2, the functions of blocks 210, 215, 220, 225, 230, 235, and 240 may be implemented using logic gates, for example, complimentary metal-oxide semiconductor (CMOS) gates, in an integrated circuit. As used herein, the term "gate" refers to a standard cell NAND2 (e.g., negative-AND logic gate) that includes two p-channel metal-oxide semiconductor (pMOS) transistors and two n-channel metal-oxide semiconductor (nMOS) transistors. In other embodiments, the function of blocks in FIG. 2 can be implemented using bipolar junction transistors (BJTs) such as NPN transistors or PNP transistors or field effect transistors (FETs) such as junction FETs (JFETs) or metal oxide semiconductor FETs (MOSFETs).

[0032] In embodiments where power metrics are determined for not more than five frequency bins, the processing of the voice activity detector may be performed with as few as 25,000 integrated logic gates in the voice activity detector. In one implementation 32,500 gates are used. As suggested above, processing and storage resources are reduced where only the most significant bits (MSBs) are used. Additional logic gates provide flexibility to determine power metrics for additional frequency bins and store more data (e.g., more MSBs or LSBs) if necessary. Thus, in one embodiment, the voice activity detector includes 50,000 integrated hardware gates. In another embodiment, the voice activity detector includes 100,000 gates. Such embodiments use less energy and space compared to the conventional implementations with current drain on the order of a microamp (μΑ) at a clock speed of 384 kHz. Conventional voice activity detectors may use more than 500,000 gates and draw current on the order of a milliamp (mA) at similar clock speeds. [0033] FIGS. 3A and 3B are spectrograms of speech in the presence of little background noise. FIG. 3A is a graph of an audio signal amplitude on the y-axis and time on the x-axis. FIG. 3B is a graph of the audio signal of FIG. 3A with the frequency on the y-axis and time on the x-axis wherein the shades of gray indicate power or intensity. The speech bursts 305, 310, 315 and 320 of FIG. 3A are readily detectable in FIG. 3B.

[0034] FIGS. 4A and 4B are spectrograms of speech shown in FIGS. 3A and 3B in the presence of background noise. FIG. 4A is a graph of an audio signal on the y- axis and time on the x-axis. FIG. 4B is a graph of the audio signal of FIG. 4A with the frequency on the y-axis and time on the x-axis wherein the shades of gray indicate power or intensity. The speech bursts 405, 410, 415 and 420 of FIG. 4A are detectable in FIG. 4B using the methods described herein despite the noise.

[0035] The graphs of FIGS. 3B and 4B can be used to illustrate how process 200 can be used to determine the presence of speech. An average of the intensity of the frequencies (e.g., at 1 ,500 Hz, 1 ,875 Hz, 2,250 Hz, 2,625 Hz, and 3,000 Hz) is taken over two seconds. As shown in FIG. 3B, the frequencies of clean speech are concentrated around the 1 ,500 Hz to 3,000 Hz range and are easily detected. As shown in FIG. 4B, when the signal is noisy, the speech can still be detected by looking at the 1 ,500 Hz to 3,000 Hz range. Frequencies below the 1 ,500 Hz range corresponding to low frequency noise such as background noise, car noise, etc. can be ignored. The intensity of frequencies above the 3,000 Hz may be above the thresholds of block 240 but can be ignored because such ranges are the minority of human speech. However, intensities within the 1 ,500 Hz to 3,000 Hz range may indicate speech, even in noisy conditions.

[0036] In some instances, voice interaction with devices such as smart speakers, smartphones, tablets, computers, etc. is becoming prevalent in many people's lives. These devices need more and more resources as voice activation becomes more ubiquitous and comprehensive. However, such resources (e.g., computing power, space within the devices, battery power, etc.) are generally becoming more scarce as these devices become more complex, more capable, and smaller in size. [0037] One application where voice activity detection is used is in always-ON voice activated devices like handheld wireless communication handsets. In these applications, voice activity detection enables the device to operate in a power conservation mode until voice activity is detected. Upon the detection of voice activity, the one or more devices transition to a higher power consumption mode during which speech or voice detection occurs. In some instances, microphones are included in devices that facilitate voice activation of the devices. For example, a microphone can constantly monitor audio signals for specific keywords such as "OK, Joogle," "Hey, Shiri," or "Alex." The microphone can include circuitry that monitors for a human voice. Once a human voice is detected, the microphone can transmit a wake-up signal to a central processing unit (CPU). The CPU can then analyze the audio signal from the microphone to determine whether the specific keyword was spoken. In an alternative embodiment, the microphone can monitor for the human voice, determine whether the specific keyword was spoken, then transmit the wake- up signal to the CPU. Thus, the CPU can normally be in a low-power "sleep" mode while the microphone monitors for a human voice. Such a configuration reduces the power consumption of the device comparted to the CPU being "awake" and monitoring for the keywords.

[0038] While using microphones to monitor for a human voice (or specific keywords) generally reduces power consumption of the device, further reduction of microphone power consumption is also beneficial. In various embodiments described herein, a simplified microphone is provided that consumes less power than typical

microphones while maintaining a high level of accuracy in detecting auditory signals such as a human voice. Furthermore, in some embodiments, the simplified microphone is smaller than typical microphones.

[0039] In some instances, a simplified voice activity detector can be used to detect whether a voice is present. For example, some smartphones include a feature that allows the user to speak a keyword at any time to activate a function from the smartphone. In one example, a user speaks the keyword and then a command such as, "OK, Joogle" followed by, "Find directions to the nearest farrier." Using the example, the simplified voice activity detector can constantly monitor a signal from an acoustic transducer for an indication of a human voice (e.g., the beginning of "OK, Joogle"). When the simplified voice activity detector detects the human voice, the voice activity detector transmits a signal to pre-processing circuitry. The preprocessing circuitry can analyze the audio signal for the specific keyword. When the pre-processing circuitry determines that the keyword was spoken, the audio signal is transmitted to a main processor that detects the command and responds to the command.

[0040] Using a stepped architecture (e.g., in the example above) reduces the total amount of energy consumed by the smart device. That is, the simplified voice activity detector consumes the least amount of power and is on the most amount of time (e.g., constantly listening for a voice), the pre-processing circuitry consumes a moderate amount of power and is on a moderate amount of time (e.g., any time that the voice activity detector detects a voice), and the main processor consumes the most amount of power but is on the least amount of time (e.g., only when the preprocessing circuitry detects the keyword or is otherwise used by a user).

[0041] In some illustrative embodiments, a MEMS element (e.g., the electro- acoustic transducer 105) can be any suitable transducer that converts acoustic energy into an electrical signal. For example, the MEMS element can include a variable capacitor that changes capacitance based on a change in air pressure. In the embodiment shown in FIG. 1 , the electrical signal is indicative of the acoustic energy and is transmitted to the electrical circuit 1 10, which includes a voice activity detector 106. The voice activity detector 106 monitors the electrical signal for an indication that the acoustic energy includes a human voice. When the voice activity detector 106 detects a human voice (or any other suitable acoustic indication), a "wake up" signal is transmitted to the preprocessing circuitry 1 15. In an illustrative embodiment, the preprocessing circuitry 1 15 monitors the electrical signal for a specific indication. For example, the preprocessing circuitry 1 15 can monitor for an indication of a specific keyword or other auditory signal. Using an example above, the preprocessing circuitry 1 15 can monitor for the sound of "OK, Joogle." In other embodiments, the preprocessing circuitry 1 15 can monitor for any other suitable command or portion of a command. For example, the preprocessing circuitry 1 15 can monitor for a phoneme, a sonant, a syllable, a diphthong, a phonetic unit, a command, a phrase, a sentence, etc.

[0042] In an illustrative embodiment, when the preprocessing circuitry 1 15 detects that the specific indication is present, a signal is transmitted to the CPU 120. The signal received by the CPU 120 can be an indication that the CPU 120 should "wake up" and/or monitor a signal from the microphone with the MEMS element. Thus, the CPU 120 does not receive or ignores electrical signals from the microphone until the processing circuitry 1 15 detects the specific indication. In an alternative embodiment, the voice activity detector 106 and the preprocessing circuitry 1 15 are implemented together.

[0043] In some illustrative embodiments, a microphone is a MEMS microphone. In such embodiments, the microphone can include a capacitive transducer and an application-specific integrated circuit (ASIC) electrically connected to the transducer. The ASIC can be used to, at least partially, process the electrical signal from the transducer. For example, the ASIC can include the voice activity detector of FIG. 2. In an illustrative embodiment, the MEMS microphone includes a base and a cover that define an internal volume (e.g., a back volume), and the ASIC is within the internal volume (e.g., attached to the base).

[0044] The foregoing description of illustrative embodiments has been presented for purposes of illustration and of description. It is not intended to be exhaustive or limiting with respect to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosed embodiments. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.




 
Previous Patent: CD47-CAR-T CELLS

Next Patent: SMART TOUCHSCREEN DISPLAY