Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE FOR CAPTURING AND OUTPUTTING AUDIO
Document Type and Number:
WIPO Patent Application WO/2018/158558
Kind Code:
A1
Abstract:
This specification describes a device comprising: an acoustically transparent housing; a microphone array comprising a plurality of microphones for capturing audio data, the plurality of microphones being located at a lower end of the housing; at least one speaker configured to output an output sound dependent on an output audio signal, the speaker being located above the microphone array; and a data processing apparatus configured to provide: a plurality of acoustic echo cancellers, each for use in generating a respective echo cancelled audio signal dependent on audio data derived from a respective microphone, using information derived from the said output audio signal; and an adaptive beamformer configured to combine the echo cancelled audio signals..

Inventors:
SANTILLI ANDREA (GB)
Application Number:
PCT/GB2018/050418
Publication Date:
September 07, 2018
Filing Date:
February 16, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ASDSP LTD (GB)
International Classes:
H04R3/00; H04M9/08; H04R1/26
Foreign References:
US20160007114A12016-01-07
US20060239443A12006-10-26
US20040125942A12004-07-01
GB2495130A2013-04-03
Attorney, Agent or Firm:
DAINTY, Katherine et al. (GB)
Download PDF:
Claims:
Claims

1. A device comprising:

an acoustically transparent housing;

5 a microphone array comprising a plurality of microphones for capturing audio data, the plurality of microphones being located at a lower end of the housing;

at least one speaker configured to output an output sound dependent on an output audio signal, the speaker being located above the microphone array; and

a data processing apparatus configured to provide:

o a plurality of acoustic echo cancellers, each for use in generating a

respective echo cancelled audio signal dependent on audio data derived from a respective microphone, using information derived from the said output audio signal; and

an adaptive beamformer configured to combine the echo cancelled audio5 signals.

2. A device according to claim l, wherein the plurality of microphones are provided in a circular array. o 3. A device according to claim 2, wherein the microphones are provided at equally spaced intervals at the lower end of the housing.

4. A device according to any preceding claim, wherein the microphones are omnidirectional microphones

5

5. A device according to any preceding claim, wherein the microphone array comprises at least two microphones and no more than eight microphones.

6. A device according to any preceding claim, wherein the speaker is concentric0 with the microphone array.

7. A device according to any preceding claim, wherein the microphones are located no more than 20 mm from the bottom surface of the housing. 5 8. A device according to any preceding claim, wherein the housing is cylindrical.

9. A device according to any preceding claim, wherein the housing is a mesh.

10. A device according to any preceding claim, wherein the housing comprises a perforated metal.

11. A device according to any preceding claim, wherein the speaker comprises a high frequency speaker driver and a low frequency speaker driver.

12. A device according to any preceding claim, wherein the speaker comprises a plurality of speaker drivers, each speaker driver being arranged to direct sound in a different direction to the other speaker drivers.

13. A device according to claim 12, wherein the plurality of speaker drivers are arranged to render the output audio signal as spatial audio.

14. A device according to any preceding claim, wherein the adaptive beamformer is a minimum variance distortionless response beamformer.

Description:
Device for Capturing and Outputting Audio

Field

This specification generally relates to a device for capturing and outputting audio.

Background

Audio capture and output devices such as conference speakerphones include a microphone and a speaker for capturing a user's voice and outputting the voice of a recipient.

Such devices may employ signal processing algorithms to reduce the occurrence of echo or feedback.

Summary

The specification describes a device comprising: an acoustically transparent housing; a microphone array comprising a plurality of microphones for capturing audio data, the plurality of microphones being located at a lower end of the housing; at least one speaker configured to output an output sound dependent on an output audio signal, the speaker being located above the microphone array; and a data processing apparatus configured to provide: a plurality of acoustic echo cancellers, each for use in generating a respective echo cancelled audio signal dependent on audio data derived from a respective microphone, using information derived from the said output audio signal; and an adaptive beamformer configured to combine the echo cancelled audio signals. The plurality of microphones may be provided in a circular array.

The microphones may be provided at equally spaced intervals at the lower end of the housing. The microphones may be omnidirectional microphones

The microphone array may comprise at least two microphones and no more than eight microphones.

The speaker may be concentric with the microphone array. The microphones may be located no more than 20 mm from the bottom surface of the housing.

The housing may be cylindrical.

The housing may be a mesh.

The housing may comprise a perforated metal. The speaker may comprise a high frequency speaker driver and a low frequency speaker driver.

The speaker may comprise a plurality of speaker drivers, each speaker driver being arranged to direct sound in a different direction to the other speaker drivers.

The plurality of speaker drivers may be arranged to render the output audio signal as spatial audio.

The adaptive beamformer may be a minimum variance distortionless response beamformer.

Brief Description of the Figures

For a more complete understanding of the device described herein, reference is now made to the following description taken in connection with the accompanying drawings in which:

Figure 1 is a schematic illustration of a device for capturing and outputting audio according to an embodiment of the specification;

Figure 2 is a schematic diagram illustrating the relationship between components of the device according to an embodiment of the specification;

Figure 3 is a schematic illustration of the device according to an embodiment of the specification;

Figure 4 is a schematic illustration of the device according to an embodiment of the specification;

Figure 5 is a schematic illustration of the device according to an embodiment of the specification. Detailed Description

In the description and drawings, like reference numerals may refer to like elements throughout. The device described herein is a device for capturing audio and outputting sound. The device may be, for example, a speakerphone device used for communication over a telephone network, such as in a teleconferencing system. For example, the device may be a speakerphone for use in conference calling, wherein the microphones capture sound including a user's voice for transmission to a recipient at the end of the telephone connection. The speaker may output sound corresponding to an audio signal transmitted to the device from the recipient to the user.

However, it will be understood that the device is not limited to purposes of

teleconferencing and transmitted communications. The device may be configured to output audio transmitted to the speaker from a server. The output audio may, for example, comprise computer synthesised speech, an audio track, such as music, or a pre-recorded voice file. The audio captured by the microphones in such a device may be transmitted to the server. The device or the server may be configured to recognise certain sounds or words captured by the microphone as instructions, and the device may provide an output in response to the instruction.

Figure ι is a schematic illustration of a device ι for capturing and outputting audio according to an embodiment of this specification. The device l comprises an acoustically transparent housing 10. The housing 10 may comprise any material which allows sound waves to pass through. For example, the housing 10 may comprise a mesh. The mesh may be a material including a plurality of holes through which sound waves are able to pass. For example, the mesh may be formed of perforated metal. However, it will be understood that any suitable acoustically transparent material may be utilised.

A microphone array 20 is provided at a lower end of the housing 10. The microphone array 20 comprises a plurality of microphones 21. The microphones 21 are configured to capture audio data. For example, the captured audio data may be a voice of one or more users speaking in the vicinity of the microphone array 20. The captured audio data may also include background noise other than the voice of a user. Due to the housing 10 being acoustically transparent, each microphone 21 may capture audio data from audio sources provided in a 360 degree range around the microphone array 20.

The microphones 21 may be provided as a circular array 20. The spacing between the microphones 21 may be uniform. In this way, the microphone array 20 may be configured to uniformly capture audio data in a 360 range. Optionally, the microphone array 20 may additionally include a central microphone element.

The microphones 21 may be omnidirectional microphones. This allows various beamforming algorithms to be implemented in order to improve reduction of background noise and reverberation in the captured audio data. If arrays of

omnidirectional microphones are used, the separation between the microphones 21 may be no more than a few centimetres. In this way, the directional sensitivity of the microphone array 20 may be improved while maintaining a substantially uniform 360 degree pickup range of audio sources.

In some examples, the microphone array 20 may comprise no more than eight microphones 21, and no fewer than two microphones 21. In this way, production costs may be reduced while providing a uniform 360 pickup.

In some examples, the microphone array 20 may be provided inside the housing 10 on the bottom surface. By including the microphones 21 on the bottom surface, interference from sound reflected off the surface on which the device 1 is seated may be reduced. This may occur because the distance that the reflected sound waves may travel from the surface from which they are reflected to the microphones 21 is small, which reduces the phase difference between the sound received at the microphones 21 from the audio source and the reflected sound waves, thereby reducing destructing interference. The destructive interference which may occur is commonly known as the "comb filtering effect". Indeed, the signal may be increased by constructive

interference from the reflected sound waves, thereby improving the sensitivity of the microphones 21. This effect is commonly known as "pressure reinforcement". To reduce the deterioration from destructive interference of reflected waves, the microphones 21 may be provided to be no more than 20 mm from the bottom surface of the housing. Additionally, to reduce any comb filtering effect, the speaker enclosure may be provided at least 100 mm above the microphone array to reduce its impact on the sound field in proximity of the microphone array. By placing the speaker enclosure at least 100 mm away from the microphone array, there is flexibility in the options for designing the microphone array. For example, if desired, the microphone array could be configured to use a large number of microphones, and may include a central microphone element.

In the example of Figure l, a speaker 30 is provided at a plane located above the microphone array 20. The speaker 30 may be configured to output an output sound dependent on an output audio signal. For example, the speaker 30 may be configured to receive the output audio signal, and to output an output sound corresponding to the output audio signal. The speaker 30 may be arranged to direct sound in a direction away from the microphones 21. In the example of Figure 1, the speaker 30 is arranged to direct the output sound in an upward direction. However, as described in other examples described herein, the speaker 30 may be arranged to direct sound in directions other than an upward direction. The speaker 30 is generally arranged so that the sound generated by the speaker 30s is not directed in the direction of the microphones 21. In this way, device 1 may reduce the amount of captured audio captured by the microphones 21 which corresponds to sound output by the speaker 30. Sound output by the speaker 30, if captured by the microphones 21, may result in an echo in the captured audio data.

The device 1 includes a data processing apparatus 40 which is configured to provide a plurality of acoustic echo cancellers 50 and a beamformer 60, described in more detail with respect to Figure 2. The acoustic echo cancellers 50 and the beamformer 60 may be implemented in the data processing apparatus 40 by way of a computer program. The computer program, when executed by the data processing apparatus 40, causes the data processing apparatus 40 to perform acoustic echo cancellation using audio data derived from each of the microphones 21 individually. Additionally, the computer program, when executed by the data processing apparatus, causes the data processing apparatus to perform a beamforming process on the echo cancelled signals 53 which are generated using the acoustic echo cancellers 50.

The echo cancelling process comprises removing components corresponding to the output audio signal from the audio data derived from each individual microphone 21. An acoustic echo canceller 51 may be provided to correspond respectively to each of the microphones 21. Each acoustic echo canceller 51 is configured to receive the output audio signal and to generate an echo replica, which is subtracted from the respective microphone signal at a subtraction node 52. In this way, the echo cancelling process causes an echo cancelled audio signal 53 to be generated for each microphone.

In embodiments the acoustic echo cancellers 50 are adaptive echo cancellers, and may each comprise an adaptive filter. More specifically, in producing an echo cancelled audio signal 53, the output of the subtraction node 52 may be used to adjust the coefficients of the adaptive filter.

As shown in Figure 2, the beamformer 60 is configured to receive each echo cancelled audio signal 53. The beamformer 60 may be an adaptive beamformer 60 configured to perform adaptive spatial signal processing to the echo cancelled signals 53. The beamforming process may adapt to increase a signal strength from a given direction. For example, the direction of a user's voice may be determined using speech detection and direction of arrival algorithms, and the signals from this direction may be increased by the beamformer 60. Additionally, spectral statistics of noise maybe determined, and the beamformer 60 may be configured to reduce the noise accordingly. Therefore, the echo cancelled signals 53 corresponding to captured audio data from each microphone 21 are combined by the beamformer 60 and filtered spatially to improve the signal quality based on a desired signal source. The beamformer 60 outputs the spatially filtered signal.

Performing acoustic echo cancellation using audio data derived from each microphone 21 individually before providing the echo cancelled signals 53 to the beamformer 60 allows the speaker 30 to be placed above the microphone array 20 without requiring the coupling factors and phase of any sound received from the speaker 30 to be equal at each microphone 21, in contrast to arrangements in which the beamformer is fed by the original microphone signals and providing the signal from the beamformer to a single input acoustic echo canceller. Therefore, the arrangement of the speaker 30 in the device 1 above the microphone array 20 may be flexible and not subject to constraints for minimising the phase difference for sound received at the microphones 21 in the array 20. In other words, the device 1 does not require a null directional response to be achieved along the vertical axes (towards the speaker) by mutually cancelling the echo among the various microphones The device 1 may be configured to implement a variety of adaptive beamforming processes. For example, the type of adaptive beamforming may be selected to reduce background noise (which may include stationary and diffuse or non stationary and localised background noise). Adaptive beamforming algorithms are able to adjust a directional response so that its nulls are directed towards dominant noise sources. The beamformer 6o of the embodiments described herein may be implemented, for example, by using a Minimum Variance Distortionless Response (MVDR) algorithm. However, it will be understood that any suitable beamforming algorithm may be selected, such as, for example, any algorithm falling inside the Generalised Sidelobe Canceller framework. MVDR may provide for a flat frequency response in a given microphone pickup direction while reducing the array gain for given interference signal, if the spectral properties of the interference can be estimated. The spectral properties may include, but are not limited to, the power spectrum and cross spectrum of microphone pairs. It will be understood than any suitable spectral property may be used. If the microphone array includes a central microphone element, a wider range of beamforming algorithms may be considered.

The adaptive beamforming algorithms may be configured to reduce localised stationary noise sources (such as, for example, noise from a PC fan). These algorithms may also be extended to reduce non-stationary noise sources if the noise sources can be distinguished from the desired audio sources such as the speech of a user. For example, as described above, speech discrimination algorithms may be used to distinguish between a user's speech and background noise. The algorithms may determine that noise coming from a given region should be treated as interference.

The beamformer 60 may be configured to reduce any residual echo leftover after the captured audio data has passed through the acoustic echo cancellers 50. Such residual echo may include distortion introduced by the speaker 30 to the sound output from the speaker 30. Such distortion does not have a corresponding signal component in the output audio signal, and so the distortion will not be removed when the acoustic echo cancellers 50 remove the output audio signal components from the captured audio data. The acoustic echo cancellers 50 may provide estimates of the residual echo spectrum and detection of the various talking states (single talk, double talk, for example), and so it the residual echo spectral statistics can be fed to a MVDR beamformer. The beamformer 60 may be configured to focus on desired speech sources while reducing noise and residual echo by dynamically shaping its directional response according to various states. Such states may include talking states such as single talk, double talk, near speech only, noise only. The directional response may also be shaped based on speech, echo, and noise spectrum and levels. Such directional response may not be possible using time invariant beamforming. For example, beamforming followed by acoustic echo cancellation used in other applications requires the use of time invariant beamformers, since any time varying process inserted along the echo path would severely degrade the performance of any acoustic echo cancellation algorithm. By using an adaptive beamformer after acoustic echo cancellation, the beamformer may be configured to switch between various beam pattern shapes and arrangement according to a user activity and position according to the level of residual echo.

By providing an adaptive beamformer at the outputs of the acoustic echo cancellers 50, the beamformer can also be used to reduce residual echo and can be designed to have an upper working limit of between 4kHz and 8kHz. The upper working limit may depend on the microphone spacing and may for example be 5kHz, 6kHz or 7kHz. The upper working limit may decrease with increasing distance between the microphones. A sealed enclosure 35 around the sides of the loudspeaker and between the loudspeaker and the microphone will provide acoustic shadowing which may be enough to keep any echo coupling low above 4kHz to 8 kHz.

The audio signal processing performed by the data processing apparatus 30 may be implemented in the short-time Fourier transform (STFT) domain, using a Hamming window with zero padding and Fast Fourier Transform (FFT). The echo cancellation may be performed using a partitioned block frequency domain adaptive filter

(PBFDAF) algorithm, or generalised multi-delay filter (GMDF) adaptive filtering.

Once in the STFT domain, the beamforming may be conveniently implemented and in particular the MVDR. In MVDR an estimate is made of the covariance matrix of the background noise (or interference more in general) for all microphone pairs. The covariance matrix is combined with the steering vector (the phase shift factor resulting from a sound coming from the desired beam pick up direction) in order to get the beam weights that generate a directional pattern with flat frequency response in the look direction and minimise noise at the output. Using a MVDR beamformer it is possible to generate various beams with various look directions (for example four beams with four look directions using four microphones), and use a beam steering algorithm based on spectral distance to pick the "loudest" among all the generated beams. It will be understood that it is possible to extend this sort of method including speech vs non-speech discrimination in order to make the beamformer to focus on just real speech and neglect non speech noises.

The data processing apparatus 40 may be of any suitable composition and may include one or more processors of any suitable type or suitable combination of types. For example, the data processing apparatus 40 may comprise a programmable processor that interprets computer program instructions and processes data. Alternatively, the data processing apparatus 40 may comprise, for example, programmable hardware with embedded firmware. A processing apparatus may alternatively or additionally include one or more specialised circuit such as field programmable gate arrays FPGA, Application Specific Integrated Circuits (ASICs), signal processing devices etc.

The data processing apparatus may include memory having computer readable instructions stored thereon, which when executed by a processor causes the processor to cause performance of operations and/or methods described herein.

Referring again to Figure 1, the speaker 30 may comprise a number of different frequency speaker drivers. Speaker drivers convert a received electrical audio signal to sound waves. In the example, of Figure 1, the speaker 30 comprises a single driver element 31 arranged to direct sound in an upward direction. The speaker 30 may also comprise a cone reflector 32 on top of the driver 31. As such, the sound may be radiated through a ring shaped window. The driver 31 may be a full range driver covering the whole audio frequency range. However, the driver 31 may be a limited frequency driver, and may be combined with drivers covering different frequency ranges, as described in more detail with reference to Figure 3 and 4.

The speaker 30 may be provided in an enclosure 35 sealed around the sides and the base. In this way, the coupling between the sound from the speaker 30 and the microphone array 20 may be reduced by reducing the sound directed towards the microphones 21. In the example of Figure l, the microphone array 20 is a circular array 20. In this example, the speaker 30 is provided to be coaxial with the microphone array 20.

However, the speaker 30 may be provided at a location other than coaxial with the microphone array 20. The speaker 30 is provided at a plane above the microphone array 20. The speaker 30 and the microphone array 20 are separated by the acoustically transparent housing 10. The acoustically transparent housing 10 allows the microphone array 20 to be exposed to surrounding sound.

A greater physical separation between the speaker 30 and the microphone array 20 may reduce the amount of sound output by the speaker 30 which is received at the microphone array 20. This may help to reduce any echo in the captured audio data. Additionally, a greater physical separation between the speaker 30 and the microphone array 20 may reduce any acoustic shadowing or scattering caused by the speaker enclosure 35 affecting the sound field around the array 20.

In the example of Figure 3, the device 1 may include a speaker 30 arrangement which differs to that depicted in Figure 1. In the example of Figure 3, the speaker 30 includes a low frequency driver 33, also commonly known as a "woofer". The speaker 30 also includes a high frequency driver 34, also commonly known as a "tweeter". The high frequency driver 34 may be provided on top of the low frequency driver 33 such that it is separated from the low frequency driver 33 by a seal. Additionally, the speaker 30 arrangement may be separated from the microphone array 20 by the enclosure 35. A cone reflector 32 may be placed on top of the high frequency driver 34. The cone reflector 32 may help to spread the distribution of the sound. Including two different frequency drivers may improve the frequency response of the speaker 30, but may increase the cost of the speaker 30 compared to the example of Figure 1.

In the example of Figure 3, the speaker 30 is provided such that the drivers 33, 34 are concentric with each other and with the microphone array 20. However, it will be understood that the embodiment is not limited to the drivers 33, 34 being provided in a concentric manner.

Figure 4 illustrates a device 1 including an alternative speaker 30 arrangement to those depicted in Figures 1 and 3. In this example, a low frequency driver 33 is provided above the microphone array 20 similarly to Figure 1, arranged to face an upward direction. The driver 33 may be provided to be concentric with the microphone array 20. The speaker 30 may further comprise a plurality of high frequency drivers 34 directed radially outward. Including a plurality of high frequency drivers 34 may further improve the frequency response of the speaker 30. However, a greater number of high frequency drivers 34 may increase the complexity and cost of the device 1.

Including a plurality of high frequency drivers 34 may provide for spatial rendering of the high frequency components a multichannel output audio signal. In spatial audio rendering, the output audio signal is rendered such that given sounds may be perceived to come from a given spatial location. By performing acoustic echo cancellation on audio data derived from each microphone 21, spatial rendering of the output audio signal maybe performed, as the system does not require equal phase and coupling factors of the sound received at each microphone 21. As higher frequencies maybe more relevant for spatial rendering, only one low frequency driver may be required in combination with the plurality of high frequency drivers 34.

Figure 5 depicts an example in which the device 1 may include a speaker 30

arrangement which differs to those depicted in Figures 1, 3 and 4. In the example of Figure 5, the speaker 30 is arranged to project sound in a direction substantially perpendicular to the upward facing direction of the speaker 30 described with reference to Figure 1. For example, when placed on a surface in a room, the device 1 may be arranged such that the speaker 30 is directed towards a user, as desired. This may improve the experience of the user receiving the sound from the speaker 30. The speaker 30 may be provided in this way for example, for use with a video

communication device where the user is expected to stay inside an area covered by a camera pointing in a fixed direction, and so the speaker 30 may be configured to be pointing towards the same area.

Since the acoustic echo cancellation and beamforming used in these embodiments does not require the speaker to be facing vertically upwards and concentric with the speaker, the speaker may be provided to face a desired direction without being concentric with the microphone array.

The echo cancellation and beamforming performed on the captured audio signals allows the speaker to be provided according to any of the embodiments described above. Indeed, it will be understood that the speaker may be positioned in any location above the microphone array and facing any direction. In addition, the embodiments described allow for a large volume enclosure to be provided for the speaker, and the speaker may include multiple drivers in order to improve the frequency response, or to provide spatial audio.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims and not solely the combinations explicitly set out in the claims.

It is noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which maybe made from departing from the scope of the present invention as defined in the appended claims.