Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR RENDERING A BINAURAL AUDIO SIGNAL
Document Type and Number:
WIPO Patent Application WO/2020/221431
Kind Code:
A1
Abstract:
The invention relates to the technical field of 3D sound, for instance, for virtual reality (VR) applications or surround sound. The invention proposes in particular a device and a method for rendering a binaural audio signal. The device is configured to obtain a direct component of the audio signal and a diffuse component and a Direction of Arrival (DoA) from a plurality of audio channels of the audio signal. Further, the device is configured to determine a head-related transfer function (HRTF) according to the DoA. Finally, the device is configured to filter the direct component based on the HRTF to obtain a modified direct component, and to generate the binaural audio signal based on the diffuse component and the modified direct component.

Inventors:
POLLOW MARTIN (DE)
FALLER CHRISTOF (CH)
FAVROT ALEXIS (CH)
TAGHIZADEH MOHAMMAD (DE)
Application Number:
PCT/EP2019/060996
Publication Date:
November 05, 2020
Filing Date:
April 30, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
POLLOW MARTIN (DE)
International Classes:
H04S7/00
Foreign References:
EP2942981A12015-11-11
EP2249334A12010-11-10
US20030035553A12003-02-20
Other References:
None
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A device (100) for rendering a binaural audio signal (107), wherein the device (100) is configured to:

obtain a direct component (102) of the audio signal (101) and a diffuse component (103) and a Direction of Arrival, DoA, (104) from a plurality of audio channels of the audio signal (101),

determine a head-related transfer function, HRTF, (105) according to the DoA (104), filter the direct component (102) based on the HRTF (105) to obtain a modified direct component (106), and

generate the binaural audio signal (107) based on the diffuse component (103) and the modified direct component (106).

2. The device (100) according to claim 1, wherein:

the plurality of audio channels are B-format channels.

3. The device (100) according to claim 1 or 2, configured to:

generate the binaural audio signal (107) by combining the diffuse component (103) with the modified direct component (106).

4. The device (100) according to one of the claims 1 to 3 configured to:

generate a left diffuse component (203L) and a right diffuse component (203R) based on the diffuse component (103),

generate a left modified direct component (106L) and a right modified direct component (106R) based on the modified direct component (106), and

combine the left diffuse component (203L) with the left modified direct component (106L), and combine the right diffuse component (203R) with the right modified direct component (106R), wherein the binaural audio signal (107) is generated based on the result of the combining.

5. The device (100) according to claim 4, configured to:

generate the left modified direct component (106L) and the right modified direct component (106R) by applying the HRTF (105) to the direct component (102).

6. The device (100) according to claim 4 or 5, configured to:

generate the left diffuse component (203L) and the right diffuse component (203R) by linear decoding of the diffuse component (103).

7. The device (100) according to one of the claims 1 to 6, configured to:

estimate a diffuseness (200) of the audio signal (101), and

obtain the direct component (102) and optionally the diffuse component (103) based on the estimated diffuseness (200).

8. The device (100) according to claim 7, configured to:

estimate the diffuseness based on a short-time Fourier transform (201) of the audio signal (101).

9. The device (100) according to claim 8, configured to:

estimate the DoA (104) based on a short-time Fourier transform of the audio channels.

10. The device (100) according to one of the claims 1 to 9, wherein:

the HRTF (105) includes a gain and a phase to be applied to each time- frequency tile of the direct component (102).

11. The device (100) according to one of the claims 1 to 10, configured to:

smooth the HRTF (104) over time to obtain a smoothed HRTF, and

modify the direct component (102) based on the smoothed HRTF.

12. The device (100) according to one of the claims 1 to 11, wherein:

the audio signal (101) is an Ambisonic signal.

13. A headphone device, wherein the headphone device comprises a device (100) for rendering a binaural audio signal (107) according to one of the claims 1 to 12.

14. A method (300) for rendering a binaural audio signal (107), wherein the method (300) comprises: obtaining (301) a direct component (102) of the audio signal (101) and a diffuse component (103) and a Direction of Arrival, DoA, (104) from a plurality of audio channels of the audio signal (101),

determining (302) a head-related transfer function, HRTF, (104) according to the DoA (105),

filtering (303) the direct component (102) based on the HRTF (104) to obtain a modified direct component (106), and

generating (304) the binaural audio signal (107) based on the diffuse component (103) and the modified direct component (106).

15. A computer program product comprising a program code for controlling a device (100) according to any one of the claims 1 to 12, or for controlling a headphone device according to claim 13, or for carrying out, when executed by a processor, the method (300) according to claim 14.

Description:
DEVICE AND METHOD FOR RENDERING A BINAURAL AUDIO SIGNAL

TECHNICAL FIELD

The present invention relates to the technical field of three-dimensionality (3D) sound, for instance, for virtual reality (VR) applications or surround sound. The invention also relates to VR compatible audio formats, e.g. First Order Ambisonic (FOA) signals (also referred to as B- format). The invention relates specifically to generating binaural sounds/signals from such audio formats. The invention proposes to this end a device and a method for rendering a binaural signal.

BACKGROUND

3D VR sounds are typically recorded and stored as FOA signals. The rendering of these FOA signals over headphones is then done by converting them into binaural sounds. The binaural sounds are obtained based on Head Related Transfer Functions (HRTFs), which model the filters from a point source emitting to the ears, such that the impressions of immersion and extemalisation are improved.

In particular, the binaural sounds are usually rendered by applying the HRTFs to decoded virtual loudspeaker signals. For instance, a straightforward approach for obtaining the binaural sounds is, to decode the FOA signals into specific loudspeaker setups with pre-defined positions, and then applying the HRTFs relatively to these positions. However, direct and linear decoding of the FOA signals does not provide enough spatial resolution to cover the entire 3D space. Moreover, the performance is often restricted by a trade-off between the computational complexity of the system and the precision (length) of the HRTFs models or measurements.

Non-linear decoding provides better localisation and spatialization, or better discrimination between direct and diffuse sounds for improved spaciousness. In particular, parametric approaches in the time- frequency domain, based on non-linear decoding into N virtual loudspeaker signals, provide such better results. However, it is still necessary to apply the HRTFs to virtual loudspeaker signals by convolutions. In particular, they are applied to the N parametrically generated virtual loudspeaker signals. Other approaches directly synthesise a two channel binaural signal using target binaural cues, which are computed either based on an analysis or based on MPEG surround spatial cues.

SUMMARY

In view of the above, embodiments of the invention aim to improve the current approaches for rendering binaural sounds. An objective is to obtain a binaural audio signal with improved spatial resolution and by less computational complexity. In particular, the use of virtual loudspeaker signals may be avoided. Further, binaural cues should not be necessary for the HRTFs. In addition, full compatibility with existing FOA signals is desired.

The objective is achieved by the embodiments of the invention as described in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.

In particular, embodiments of the invention propose separating a direct component from a diffuse component of an audio signal, then modifying the direct component based on a HRTF, which is determined based on a Direction of Arrival (DoA) related to the audio signal, and then rendering the binaural audio signal from the diffuse component and the modified direct component.

A first aspect of the invention provides a device for rendering a binaural audio signal, wherein the device is configured to: obtain a direct component of the audio signal and a diffuse component and a DoA, from a plurality of audio channels of the audio signal, determine a HRTF according to the DoA, filter the direct component based on the HRTF to obtain a modified direct component, and generate the binaural audio signal based on the diffuse component and the modified direct component.

The device of the first aspect is able to render the binaural audio signal with high spatial resolution and low computational complexity. Thereby, the use of virtual loudspeaker signals is not necessary, and also not binaural cues need to be used. The HRTF can be applied directly to the audio signal (direct component). Moreover, the device of the first aspect is full compatibility with existing audio signals, particularly FOA signals. In an implementation form of the first aspect, the plurality of audio channels are B-format channels.

That is, the audio channels may be the channels of a FOA signal (i.e. W, X, Y and Z), and the audio signal may be the FOA signal.

In an implementation form of the first aspect, the device is configured to: generate the binaural audio signal by combining the diffuse component with the modified component.

In an implementation form of the first aspect, the device is configured to: generate a left diffuse component and a right diffuse component based on the diffuse component, generate a left modified direct component and a right modified direct component based on the modified direct component, and combine the left diffuse component with the left modified direct component, and combine the right diffuse component with the right modified direct component, wherein the binaural audio signal is generated based on the result of the combining.

In this way the binaural audio signal can be rendered optimally, for instance, for a headphone device or VR device.

In an implementation form of the first aspect, the device is configured to: generate the left modified direct component and the right modified direct component by applying the HRTF to the direct component.

Applying the HRTF only to the direct component, in order to modify the direct component, ensures low computational complexity.

In an implementation form of the first aspect, the device is configured to: generate the left diffuse component and the right diffuse component by linear decoding of the diffuse component.

This provides a simple way to obtain the left/right diffuse components, with low computational complexity. In an implementation form of the first aspect, the device is configured to: estimate a diffuseness of the audio signal, and obtain the direct component and optionally the diffuse component based on the estimated diffuseness.

In an implementation form of the first aspect, the device is configured to: estimate the diffuseness based on a short-time Fourier transform of the audio signal.

In an implementation form of the first aspect, the device is configured to: estimate the Do A based on a short-time Fourier transform of the audio channels.

In an implementation form of the first aspect, the HRTF includes a gain and a phase to be applied to each time- frequency tile of the direct component.

In an implementation form of the first aspect, the device is configured to: smooth the HRTF over time to obtain a smoothed HRTF, and modify the direct component based on the smoothed HRTF.

In this way, audible audio artefacts can be reduced.

In an implementation form of the first aspect, the audio signal is an Ambisonic signal.

In particular, the audio signal may be a FOA signal.

A second aspect of the invention provides a headphone device, wherein the headphone device comprises a device for rendering a binaural audio signal according to the first aspect or any of its implementation forms.

The headphone device of the second aspect may particularly be for a 3D VR system. For instance, it may be included in such a system. The headphone device of the second aspect enjoys all advantages of the device of the first aspect.

A third aspect of the invention provides a method for rendering a binaural audio signal, wherein the method comprises: obtaining a direct component of the audio signal and a diffuse component and a DoA from a plurality of audio channels of the audio signal, determining a HRTF according to the DoA, filtering the direct component based on the HRTF to obtain a modified direct component, and generating the binaural audio signal based on the diffuse component and the modified direct component.

In an implementation form of the third aspect, the plurality of audio channels are B-format channels.

In an implementation form of the third aspect, the method comprises: generating the binaural audio signal by combining the diffuse component with the modified component.

In an implementation form of the third aspect, the method comprises: generating a left diffuse component and a right diffuse component based on the diffuse component, generating a left modified direct component and a right modified direct component based on the modified direct component, and combining the left diffuse component with the left modified direct component, and combining the right diffuse component with the right modified direct component, wherein the binaural audio signal is generated based on the result of the combining.

In an implementation form of the third aspect, the method comprises: generating the left modified direct component and the right modified direct component by applying the HRTF to the direct component.

In an implementation form of the third aspect, the method comprises: generating the left diffuse component and the right diffuse component by linear decoding of the diffuse component.

In an implementation form of the third aspect, the method comprises: estimating a diffuseness of the audio signal, and obtaining the direct component and optionally the diffuse component based on the estimated diffuseness.

In an implementation form of the third aspect, the method comprises: estimating the diffuseness based on a short-time Fourier transform of the audio signal.

In an implementation form of the third aspect, the method comprises: estimating the DoA based on a short-time Fourier transform of the audio channels. In an implementation form of the third aspect, the HRTF includes a gain and a phase to be applied to each time- frequency tile of the direct component.

In an implementation form of the third aspect, the method comprises: smoothing the HRTF over time to obtain a smoothed HRTF, and modifying the direct component based on the smoothed HRTF.

In an implementation form of the third aspect, the audio signal is an Ambisonic signal.

The method of the third aspect and its implementation forms achieve all advantages of the device of the first aspect and its respective implementation forms.

A fourth aspect of the invention provides a computer program product comprising a program code for controlling a device according to the first aspect or any of its implementation forms, or for controlling a headphone device according to the second aspect, or for carrying out, when executed by a processor, the method according to the third aspect.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms of the invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a device according to an embodiment of the invention. FIG. 2 shows a device according to an embodiment of the invention.

FIG. 3 shows a method according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a device 100 according to an embodiment of the invention. The device 100 is configured to render a binaural audio signal 107, particularly from an audio signal 101 comprising multiple audio channels, e.g. an FOA signal. The device 100 may be used in a headphone or VR device.

The device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.

The device 100 is specifically configured to obtain a direct component 102 of the audio signal 101 and a diffuse component 103 and a DoA 104 from a plurality of audio channels of the audio signal 101. That is, the device 100 may be configured to separate the audio signal 101 into the direct component 102 and the diffuse component 103, and to derive the DoA 104 related to the audio signal 101. The audio signal 101 be a FOA signal or B-format signal. The plurality of audio channels may be B-format channels.

The device 100 is further configured to determine a HRTF 105 according to the DoA 104. To this end, it may be configured to first estimate the DoA 104, for instance, based on a short-time Fourier transform of the audio channels of the audio signal 101. Then, it may use the DoA 104 to generate the HRTF 105. The device 100 is further configured to filter the direct component 102 based on the HRTF 105 (e.g. applying the HRTF 105 to the direct component 102), in order to obtain a modified direct component 106, and then to generate the binaural audio signal 107 based on the diffuse component 103 and the modified direct component 106. In particular, the device 100 may generate the binaural audio signal 107 by combining the diffuse component 103 with the modified direct component 106.

FIG. 2 shows the device 100 of FIG. 1 according to an embodiment of the invention in more detail, particularly with further optional features. The device 100 shown in FIG. 2 can generate a left diffuse component 203L and a right diffuse component 203R based on the diffuse component 103, and can generate a left modified direct component 106L and a right modified direct component 106R based on the modified direct component 106. Then, in order to generate the binaural audio signal 107, the device 100 can combine in particular the left diffuse component 203L with the left modified direct component 106L, and also combine the right diffuse component 203R with the right modified direct component 106R. The binaural audio signal 107 is thus generated based on the result of the combining.

Considering specifically, but as an example only, a FOA signal as the audio signal 101, which is to be parametrically decoded into binaural sound (i.e. into the binaural audio signal 107), the device 100 may first estimate the DoA 104 of the FOA signal 101 (e.g. azimuth and elevation angles), as well as the directness, respectively, the diffuseness 200 of the FOA signal 101.

Then, the direct component 102 and the diffuse component 103 of the FOA signal 101 may be separated. The diffuse component 103 may be decoded linearly into the left and right diffuse components 203L and 203R. Meanwhile, HRTF processing may be applied to the direct component 102, in order to obtain the left and right direct components 106L and 106R, respectively. Eventually both the direct and diffuse components may be combined to obtain the binaural audio signal 107 (including binaural audio channel, e.g. left and right).

The device 100 does not rely on numerous decoded virtual signals, but the HRTF 105 can be applied directly to the FOA signal 101. Moreover, the HRTF 105 can be integrated in the parametrical model and can take advantage of FOA analysis, i.e. can be fully adaptively computed from the DOA estimate. Thus, the device 100 and the algorithm it performs are compact and computational efficient, since all operations may be done independently for each time- frequency tile. An exemplary specific algorithm, which the device 100 may perform, is described in the following.

Considering the complex spectra W, X, Y and Z of the FOA signal 101, which may be obtained running a N STFT points short-time Fourier transform (STFT 201 in FIG. 2), a DOA analysis can be performed according to:

are, respectively, the corresponding azimuth and elevation angles estimated at frequency bin i and time frame k,“Re{...}” means“The real part of ...”. Simultaneously, a directness estimation can be performed based on the same spectra of the FOA signal 101, according to:

The directness estimate (3) can then be used to separate the direct component 102 from the diffuse component of the FOA signal 101, according to:

and Thereby, g stands for an exponent design parameter, which is typically chosen such that g=0.5.

The left and right diffuse components 203L and 203R (L diff and R diff in FIG. 2) may then be obtained by linear decoding of the FOA diffuse component 102 obtained in (5). For instance, left and right channels can be rendered through two decoded cardioid signals with maximum angle separation, i.e. according to:

In this way, the left and right rendered diffuse components 203L and 203R will benefit from the best possible de-correlation. Moreover, the decoding can possibly be made frequency dependent, in order to follow the physical shape of the cross-correlation coefficient between both ear signals.

The left and right modified direct components 106L and 106R (L dir and R dir in FIG. 2) may be obtained by applying the HRTF processing based on the HRTF 105 to the FOA direct components 102 obtained in (4). A simple HRTF 105 model can be directly derived from the DO A estimation given in (1) and (2), wherein the inter-aural level differences (ILDs) between both ears may be derived given the first order filter according to:

Thereby, the coefficient a + and b may be derived from: a + (q,F) = 1 + sin q cos F, (8) and K is here the diameter of the head of the HRTF model and c being the speed of sound. Considering the magnitude of the filter H at a given frequency f the left and right HRTF gains are of the HRTF model can be obtained as:

and

Thereby, a-(q,F) = 1 - sin q cos F .he inter-aural time differences (ITDs) between both ears can be derived assuming plane waves propagation:

Given the ILDs and ITDs derived in (10)-(13), the HRTF model may simply comprise or consists of a gain and phase to be apply on each time- frequency tiles of the direct sound contributions of the FOA signal 101 :

with, by construction, The derived HRTF 105 parameters may be smoothed over time to reduce audible audio artefacts, e.g. according to:

Thereby, a HRTF may be determined by:

T HRTF is the averaging time-constant in seconds and f s is the spectrum sampling frequency. and may be obtained analogously.

The left and right direct components 106L and 106R, respectively, i.e. the direct sound contributions to the binaural audio signal 107, may be obtained by applying the previously derived filters to the omnidirectional direct signal according to:

The direct components 106L and 106R may also be obtained by applying the previously derived filters to combinations of B-Format channels, e.g. according to:

In this, case left and right oriented sub-cardioid B-Format linear combinations may be used. may be adjusted to compensate for a gain difference between and the left and

right sub-cardioid, respectively.

Finally, the final binaural signal 107 may be reconstructed by adding the direct and diffuses sound contributions.

L L dir + L diff, (21) and

R R dir + R diff . (22)

FIG. 3 shows a method 300 according to an embodiment of the invention. The method 300 is for rendering a binaural signal 107, and can be performed by the device 100 as shown in FIG. 1 or FIG. 2.

The method 300 comprises: a step 301 of obtaining a direct component 102 of the audio signal 101 and a diffuse component 103 and a DoA 104 from a plurality of audio channels of the audio signal 101; a step 302 of determining a HRTF 104 according to the DoA 105; a step 303 of filtering the direct component 102 based on the HRTF 104 to obtain a modified direct component 106; and a step 304 of generating the binaural audio signal 107 based on the diffuse component 103 and the modified direct component 106.

The device 100, method 300, and the detailed algorithm can have the following advantages:

• There may be no need for virtual loudspeaker signals. Instead, the direct component 102 and the diffuse component 103 (particularly the left/right diffuse components 203L/203R) may be separately obtained from the single FOA channels (i.e. the FOA signal 101).

• The HRTF 105 may not require target binaural cues. Instead, the two binaural channels can be synthesized based on:

The DOA for the direct sound.

Purely (frequency dependent) linear rendering of the diffuse sound.

• The device 100, method 300, and algorithm may be compatible with existing FOA signals.

The invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.