Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
EFFICIENT TIME DELAY SYNTHESIS
Document Type and Number:
WIPO Patent Application WO/2024/100110
Kind Code:
A1
Abstract:
There is provided techniques for adjusting timing of output audio signals to achieve desired inter-channel time difference (ITD) between output audio signals. A method comprises receiving a current ITD value and an audio frame, and determining transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame. The time shift within the determined transition times t1, t2 is applied in generation of the first output signal and the second output signal.

Inventors:
NORVELL ERIK (SE)
Application Number:
PCT/EP2023/081137
Publication Date:
May 16, 2024
Filing Date:
November 08, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
H04S1/00; G10L19/008
Foreign References:
US20200082834A12020-03-12
Other References:
CUEVAS-RODRÍGUEZ MARÍA ET AL: "3D Tune-In Toolkit: An open-source library for real-time binaural spatialisation", PLOS ONE, vol. 14, no. 3, 11 March 2019 (2019-03-11), pages 1 - 37, XP055798563, Retrieved from the Internet DOI: 10.1371/journal.pone.0211899
JEAN-MARC JOT ET AL: "Binaural Simulation of Complex Acoustic Scenes for Interactive Audio", 15TH INTERNATIONAL CONFERENCE: AUDIO, ACOUSTICS & SMALL SPACES, AUDIO ENGINEERING SOCIETY, US, vol. 121, 1 January 2006 (2006-01-01), pages 1 - 20, XP007905995
LASSE LAAKSONEN ET AL: "DRAFT TS 26.253 (Codec for Immersive Voice and Audio Services, Detailed Algorithmic Description incl. RTP payload format and SDP parameter definitions)", 3GPP DRAFT; S4-231842; TYPE DISCUSSION, 3RD GENERATION PARTNERSHIP PROJECT (3GPP), MOBILE COMPETENCE CENTRE ; 650, ROUTE DES LUCIOLES ; F-06921 SOPHIA-ANTIPOLIS CEDEX ; FRANCE, vol. 3GPP SA 4, no. Chicago, US; 20231113 - 20231117, 7 November 2023 (2023-11-07), XP052546126, Retrieved from the Internet [retrieved on 20231107]
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS 1. A method for adjusting timing of output audio signals to achieve desired inter-channel time difference, ITD, between output audio signals, the method comprising: receiving (401) a current ITD value and an audio frame; determining (405) transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame; and applying (407) the time shift within the determined transition times t1, t2 in generation of the first output signal and the second output signal. 2. The method of claim 1, wherein at least a part of the audio frame is stored (403) in memory to be used in synthesizing ITD in the following frame. 3. The method of any of claims 1-2, wherein the audio frames are part of an audio object comprising an audio signal and position metadata describing an object position, the method further comprising obtaining the time shift from the position metadata. 4. The method of any of claims 1-3, further comprising: computing (801) a total transition length based on a frame length of the current frame ^ and a lookahead memory ^^^^ required for resampling, the total transition length being divided into two parts comprising t1 and t2; determining (803) a transition length t3 based the lookahead memory ^^^^; and determining (805) the total transition length based on a maximum allowed transition length and the transition length t3. 5. The method of claim 4, wherein determining the transition length t3 comprises determining the transition length t3 according to: ^^ = max(0, ^^^^|^^^(^ − 1)|), where ^^^(^ − 1) is the ITD of the previous audio frame comprising N samples, and determining the total transition length according to: ^^^^ wherein ^^^^ ≤ ^, wherein ^^^^ is the maximum allowed transition length. 6. The method of any of claims 1-5, wherein determining the transition times t1, t2 comprises: responsive to the sign of the current ITD and the previous ITD being the same, assigning the total transition length to one of the transition times t1, t2 and setting the other one to zero.

7. The method of any of claims 1-5, wherein determining the transition times t1, t2 comprises: responsive to the sign of the current ITD and a sign of the previous ITD being different, applying a shift operation on both the first output signal and the second output signal by splitting the total transition length into two parts to determine the transition times t1, t2. 8. The method of claim 7, wherein splitting the total transition length into two parts to determine the transition times t1, t2 comprises splitting the total transition length according to: where ^^^(^) is the current ITD and [] represents a rounding operation to the nearest integer. 9. The method of any of claims 4-8, further comprising: populating a processing buffer (510) using the audio frame; wherein applying the transition times t1, t2 determined in generation of the first output signal and the second output signal comprises: responsive to the sign of the current ITD and the sign of the previous ITD being the same or responsive to one of the current ITD and the previous ITD being zero: adjusting the processing buffer (510) to populate a first output buffer (550) by assigning the total transition length to t1 and setting t2 to zero; and copying an input signal part of the processing buffer (510) to a second output buffer (560). 10. The method of claim 9, wherein applying the determined transition times ^^, ^^ in generation of the first output signal and the second output signal further comprises: responsive to ^^^(^) = ^^^(^ − 1) and a sign of one of the current ITD and the previous ITD being negative, thereby indicating that one of the first output signal and the second output signal is ahead of the other of the first output signal and the second output signal, delaying an output buffer comprising whichever one of the first output buffer (550) or the second output buffer (560) is associated with the other of the first output signal and the second output signal by the total transition length. 11. The method of any of claims 9-10, wherein applying the determined transition times ^^, ^^ in generation of the first output signal and the second output signal further comprises: responsive to |^^^(^)| > |^^^(^ − 1)|, generating a transition by: extending the length of the frame in the processing buffer (510) from length ^^ + |^^^(^ − 1)||^^^(^)| to an output frame of length ^^; and responsive to the transition length ^^ being larger than zero, adding the last ^^ samples of the output channel by copying from the processing buffer (510), 12. The method of any of claims 9-11, wherein applying the determined transition times ^^, ^^ in generation of the first output signal and the second output signal further comprises: responsive to |^^^(^)| < |^^^(^ − 1)|, adding the last t3 samples of the first output buffer (550) by copying from the processing buffer (510). 13. The method of any of claims 9-12, wherein applying the determined transition times ^^, ^^ in generation of the first output signal and the second output signal further comprises: responsive to ^^^(^) ∙ ^^^(^ − 1) < 0, splitting the total transition length according to where [] represents a rounding operation to a nearest integer, where splitting the total transition length comprises: resampling samples ^ = −|^^^(^ − 1)|, … , ^^ − 1 of length ^^ + |^^^(^ − 1)| to fit into samples ^ = − 1 of the first output buffer (550); copying samples ^ = 0, … , ^^ − 1 to corresponding indices in the second output buffer (560); resampling samples ^^^^(^), ^ = ^^, … , ^^ + ^^ − 1 − |^^^(^)| of length ^^ − |^^^(^)| to fit into the samples ^ = ^^, … , ^^ + ^^ − 1 of the length ^^ in the second output buffer (560). 14. The method of any of claims 9-13, wherein generation of the second output signal further comprises: responsive to the transition length ^^ being larger than zero, adding the last ^^ samples of the output channel by copying from the processing buffer (510), ^^^^(^), ^ = ^ − 1 − ^3 − |^^^(^)|, … ^ − 1 − |^^^(^)|.

15. The method of any of claims 9-14, wherein applying the determined transition times t1, t2 in generation of the first output signal and the second output signal further comprises: responsive to ^^^(^ − 1) = 0 and ^^^(^) > 0, assigning the first output buffer (550) to the second output signal and the second output buffer (560) to the first output signal; responsive to ^^^(^ − 1) = 0 and ^^^(^) ≤ 0, assigning the first output buffer (550) to the first output signal and the second output buffer (560) to the second output signal; responsive to ^^^(^ − 1) > 0, assigning the first output buffer (550) to the second output signal and the second output buffer (560) to the first output signal; and responsive to ^^^(^ − 1) < 0, assigning the first output buffer (550) to the first output signal and the second output buffer (560) to the second output signal. 16. An apparatus (112, 300, 1502) for adjusting timing of output audio signals to achieve desired inter-channel time difference, ITD, between output audio signals, the apparatus being adapted to: receive a current ITD value and an audio frame; determine transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame; and apply the time shift within the determined transition times t1, t2 in generation of the first output signal and the second output signal. 17. The apparatus (112, 300, 1502) of claim 16, wherein the apparatus is further adapted to perform the method according to any of claims 2-15. 18. An apparatus (112, 300, 1502) comprising: processing circuitry (1202); and memory (1210) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the apparatus to perform operations comprising: receiving a current inter-channel time difference, ITD, value and an audio frame; determining transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame; and applying (407) the time shift within the determined transition times t1, t2 in generation of the first output signal and the second output signal.

19. The apparatus (112, 300, 1502) of claim 18 wherein the memory includes further instructions that when executed by the processing circuitry causes the apparatus to perform operations according to any of claims 2-15. 20. A computer program comprising program code to be executed by processing circuitry (1202) of an apparatus (112, 300, 1502) whereby execution of the program code causes the apparatus to perform operations comprising: receiving a current ITD value and an audio frame; determining transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame; and applying the time shift within the determined transition times t1, t2 in generation of the first output signal and the second output signal. 21. The computer program of claim 20 comprising further program code, whereby execution of the program code causes the apparatus (112, 300, 1502) to perform according to any of claims 2-15. 22. A computer program product comprising a non-transitory storage medium including program code to be executed by processing circuitry (1202) of an apparatus (112, 300, 1502) whereby execution of the program code causes the apparatus (112, 300, 1502) to perform operations comprising: receiving a current ITD value and an audio frame; determining transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame; and applying the time shift within the determined transition times t1, t2 in generation of the first output signal and the second output signal. 23. The computer program product of claim 22, wherein the non-transitory storage medium includes further program code, whereby execution of the program code causes the apparatus (112, 300, 1502) to perform according to any of claims 2-15.

Description:
EFFICIENT TIME DELAY SYNTHESIS TECHNICAL FIELD [0001] The present disclosure relates generally to communications, and more particularly to communication methods and related devices and nodes supporting audio encoding and decoding. BACKGROUND [0002] Spatial audio is a description of a sound field that immerses a listener. There are several formats of spatial audio. The most common one is the stereo format, where the sound field is rendered through either two speakers or a set of headphones. In scenarios where the playback is on a larger set of loudspeakers, such as 5.1, 7.1+4 or 22.2, spatial audio is often referred to as multichannel audio. There are also spatial audio formats that do not depend on the layout of the loudspeaker system, but rather describes the sound field itself. Such descriptions include the Wave Field Synthesis (WFS), where the sound field is captured by an array of microphones and symmetrically reproduced by an array of loudspeakers. Another popular format is the Ambisonics, which rely on spherical harmonics captured with a compact microphone array. Ambisonics has become more popular recently, since they are suitable for listener centric rendering such as Virtual Reality (VR) and Augmented Reality (AR) audio rendering, and they are inherently suitable for rotation. They may also be coupled with a 360-video capture for reconstruction of an experienced scene. [0003] The multichannel audio formats may be played back directly on the loudspeaker setup that they are designed for. However, if there is not a loudspeaker configuration, the audio cannot be played back without adapting the audio. This adaptation is often referred to as rendering the spatial audio for the playback system. If one has a 22.2 multichannel signal or an Ambisonics signal, it may for instance be rendered for playback on a 5.1 system or a set of headphones. When rendering for headphones, the audio that reaches the ears is typically modeled using Head Related Filters (HRF) or Head Related Transfer Functions (HRTF). The filters model the direction of arrival (DoA) of a sound source, such that the listener perceives the sound coming from this direction. This is achieved by a coloration of the spectrum, level difference between the ears and a time difference caused by the difference in length of the path to the left and right ears. This time difference is often referred to as an inter-aural time difference, or an inter-channel time difference (ITD). A time difference between the channels may be created by filtering one or both channels with a Dirac pulse: ℎ = ^(^ − ^ ^ ). However, the transition between different time shifts, for instance for a moving source, needs to be handled. SUMMARY [0004] When modeling the HRF, the spectral coloration may be done using a filter, and the time difference may be generated by time shifts. The present disclosure applies the time shifts in an efficient way when crossing the zero boundary for the shift. [0005] When changing sign of the time delay parameter, one needs to perform a shift operation on both output channels. To limit the complexity of the shift operation, the total transition length is shared between the channels. The sharing is done proportionally to the size of the shift on each side of the zero point. [0006] According to a first aspect there is presented a method for adjusting timing of output audio signals to achieve desired inter-channel time difference (ITD) between output audio signals. The method comprises receiving a current ITD value and an audio frame, and determining transition times t 1 , t 2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame. The time shift within the determined transition times t1, t2 is applied in generation of the first output signal and the second output signal. The method further comprises storing at least part of the audio frame to be used in synthesizing ITD in the following frame. [0007] According to a second aspect there is presented an apparatus for adjusting timing of output audio signals to achieve desired inter-channel time difference, ITD, between output audio signals. The apparatus is adapted to receive a current ITD value and an audio frame, and to determine transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame. The apparatus is adapted to apply the time shift within the determined transition times t 1 , t2 in generation of the first output signal and the second output signal. [0008] According to a third aspect there is presented an apparatus comprising processing circuitry and memory coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the apparatus to perform operations comprising: receiving a current inter-channel time difference, ITD, value and an audio frame, determining transition times t1, t2 to perform a time shift to apply to at least one of a first output signal and a second output signal based on the ITD of a current frame and an ITD of a previous frame, and applying the time shift within the determined transition times t 1 , t 2 in generation of the first output signal and the second output signal. [0009] According to a fourth aspect there is presented a computer program comprising program code to be executed by processing circuitry of an apparatus whereby execution of the program code causes apparatus to perform operations of the first aspect. [0010] According to a fifth aspect there is presented a computer program product comprising a non-transitory storage medium including program code to be executed by processing circuitry of an apparatus whereby execution of the program code causes the apparatus to perform operations of the first aspect. [0011] Certain embodiments may provide one or more of the following technical advantage(s). The speed of the adjustment is kept consistent for switches across the zero boundary, and the computational complexity is kept low. The method aims to produce two channels with a time delay, where the time delay may be updated each frame. The updates to the time delay can be done with a minimum of transition artefacts. BRIEF DESCRIPTION OF THE DRAWINGS [0012] The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non-limiting embodiments of inventive concepts. In the drawings: [0013] Figure 1 is a block diagram illustrating an environment in which various embodiments of the present disclosure may be implemented; [0014] Figure 2 is a block diagram of an audio object renderer according to some embodiments of the present disclosure; [0015] Figure 3 is a block diagram of a parametric stereo decoder according to some embodiments of the present disclosure [0016] Figure 4 is a flow chart illustrating operations of an ITD synthesizer according to some embodiments of the present disclosure; [0017] Figure 5 is a block diagram of an ITD synthesizer according to some embodiments of the present disclosure; [0018] Figure 6 is a flow chart illustrating operations of the ITD synthesizer according to some embodiments of the present disclosure; [0019] Figure 7 is an illustration of buffer operations the ITD synthesizer performs according to some embodiments of the present disclosure; [0020] Figure 8 is a flow chart illustrating operations of an ITD synthesizer according to some embodiments of the present disclosure; [0021] Figures 9-11 are illustrations of buffer operations the ITD synthesizer performs according to some embodiments of the present disclosure; [0022] Figure 12 is an illustration of a sinc resampling function to handle compressing and extending segments of the signal; [0023] Figure 13 is a block diagram of an audio object renderer in accordance with some embodiments; [0024] Figure 14 is a block diagram of a host computer communicating with an encoder and/or a decoder in accordance with some embodiments; and [0025] Figure 15 is a block diagram of a virtualization environment in accordance with some embodiments. DETAILED DESCRIPTION [0026] Some of the embodiments contemplated herein will now be described more fully with reference to the accompanying drawings. Embodiments are provided by way of example to convey the scope of the subject matter to those skilled in the art, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment. [0027] Figure 1 illustrates an example of an operating environment in which the various embodiments of the present disclosure may be implemented. Turning to Figure 1, in the example operating environment 100, the encoder 102 receives data, such as an audio file, to be encoded from an entity through network 104, such as a host 106, and/or from storage 108. In some embodiments, the host 106 may communicate directly to the encoder 102. The encoder 102 encodes the audio file as described herein and either stores the encoded audio file in storage 108 or transmits the encoded audio file to a decoder 112 having an audio object renderer 114 via network 110. The audio object renderer 114 within the decoder 112 renders the decoded audio file and transmits the rendered decoded audio file to an audio player 116 for playback. For example, the audio player 116 may play the rendered decoded audio file for a spatial audio representation such as a Virtual Reality conference or computer game. The audio player 116 may be or be comprised in a user equipment, a terminal, a mobile phone, and the like. In other embodiments, the host 106 may transmit encoded audio files to the audio object renderer 114 via network 110. In some embodiments, the audio object renderer 114 may be a stand-alone device between the decoder 112 and audio player 116 in scenarios where the decoded audio does not directly "fit" the audio player 116 as, for example, to render the decoded audio file for headphones. [0028] As previously indicated, when changing sign of the time delay parameter, one needs to perform a shift operation on both output channels. To limit the complexity of the shift operation, the total transition length is shared between the channels. The sharing is done proportionally to the size of the shift on each side of the zero point following the formula: where ^^^ ( ^ ) and ^^^ ( ^ − 1 ) are inter-channel time differences, ^ is a subframe index, ^ ^^^ is a total transition length and ^ ^ and ^ ^ are the transition lengths to perform time stretching or compressing operations. The time stretching or compressing operation may also be referred to as a time shifting operation or a resampling operation. The transition lengths may be expressed in seconds, or in a number of samples for a discretely sampled audio signal. The transition length may also be referred to as a transition time. [0029] The above formula may be simplified (with integer rounding) to: [0030] In some embodiments of the present disclosure, the method operates in an ITD synthesizer that is implemented within an audio object renderer 114, as illustrated in Figure 2. In other embodiments, the ITD synthesizer may be implemented within a parametric stereo decoder. The method operates on segments of audio called frames, where each frame or subframe ^ consists of ^ samples. ^ ( ^, ^ ) , ^ = 0,1,2, … , ^ − 1 [0031] The frames may for instance constitute segments of audio from a decoded audio object, a mono downmix channel in a parametric stereo decoder or an input channel to an audio object renderer. Here, the audio object renderer 114 receives the audio object comprising an audio signal and position metadata, describing the position of the audio object. The position may be absolute or relative to a listener position. The position metadata is input to the HR filter module 210, which provides an ITD value and a set of HR filters for the left and right channels. The time delay parameter for frame ^ is an integer in the range ^^^(^) = [−^^^ ^^^ , ^^^ ^^^ ]. [0032] In case the frames come from a parametric stereo decoder, the ^^^(^) can be found by analyzing the input channels to a stereo encoder. Preferably, the input channels are aligned by compensating for the ^^^(^) before producing a down-mix channel. The down-mix channel would be encoded together with the stereo parameters including ^^^(^), to be decoded and reconstructed in a parametric stereo decoder. The parametric stereo decoder would reconstruct the down-mix signal, the stereo parameters including at least a reconstruction of ^^^(^) and synthesize two output channels with the corresponding ^^^(^). [0033] The audio and position metadata may e.g., come from an audio object decoder or generated by a 3D audio engine for spatial audio representation such as for a Virtual Reality conference or computer game. The HR filter module 210 may e.g., be a database of stored filters and ITD values, or it may be a model based database producing filters and ITD values for the given position data. The ITD value is input to the ITD synthesizer 220, which produces two output signals based on the input audio frame, where the output signals have the desired ITD. The two channels are filtered through the left and right filters 230 and 240, to produce the synthesized left and right channels. The audio object may be added together with one or more additional objects. The output left and right channels may be forwarded to an audio device for playback. [0034] This is illustrated in the flowchart of Figure 4 of operations of method 400 the ITD synthesizer 220 performs in some embodiments. In block 401, the ITD synthesizer 220 receives a current ITD and an audio frame, wherein each frame ^ comprises ^ samples. Upon reception of the audio frame or at any time during the processing of the frame, the ITD synthesizer 220 may store, in block 403, at least a part of the current input audio frame to be used for processing ITD in a following audio frame. [0035] In block 405, the ITD synthesizer 220 determines transition times t1, t2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on signs of an inter-channel time difference, ITD, of the current audio frame and an ITD of a previous audio frame. The ITD comes from the HRF filter module 210 in the context of Figure 2. In other embodiments where the ITD synthesizer is implemented as part of a parametric stereo decoder, the ITD is reconstructed from a bitstream. Here, the time shift denotes the operation to smoothly transition to a target ITD. [0036] In block 407, the ITD synthesizer 220 applies the time shift within the determined transition times t 1 , t 2 in generation of the output signal 0 and the output signal 1. [0037] Prior to describing further detail of the ITD synthesizer 220, Figure 3 illustrates an embodiment where the ITD synthesizer may be implemented within a parametric stereo decoder 300. In the parametric stereo decoder 300, stereo parameters including ITD parameters are decoded by parameter decoder 310, and optionally a reconstructed residual signal is produced by a residual decoder 320. The down-mix decoder 330 is configured to decode and reconstruct an encoded down-mix signal to output reconstructed down-mix signals where the time shift can be applied. The reconstructed down-mix, the reconstructed stereo parameters and optionally a reconstructed residual signal is fed to a stereo up-mixer 340 to produce the reconstructed stereo signal. The ITD synthesizer 220 is part of the stereo up-mixer 340. [0038] The ITD synthesizer 220 is described in further detail in Figure 5, and also performs the operations illustrated in Figure 6. Turning to Figure 6, in step 601, the processing buffer 510 is populated using the current input audio frame ^(^, ^) and the signal memory 520. The length of the memory ^ ^^^ should at least be the sum of the maximum time shift ^^^ ^^^ and the lookback/lookahead memory ^^ ^^ needed for the resampling function. [0039] The processing buffer 510 is illustrated in Figure 7, where the middle plot shows the processing buffer corresponds to the first value of the current frame. The input frame is also fed to the signal memory 520 to be used in the next frame. In step 603, the ITD value of the current frame ^^^ ( ^ ) is input to the transition length calculator 530, together with ^^^ ( ^ − 1 ) from the ITD memory 540. The transition lengths are computed based on ^^^(^) and ^^^(^ − 1). First the total transition length is computed based on the frame length ^ and the lookahead ^^ ^^ needed by the resampler 570, 580. If ^^^(^) < ^^ ^^ , a small part of the processing buffer must be kept to accommodate the lookahead without introducing processing delay due to the resampling. The transition may be divided into three parts, ^ ^ , ^ ^ and ^ ^ . The transition length ^ ^ may be seen as a buffer length to avoid reading out of memory in the resampling operation. The resampler must leave at least ^^ ^^ samples at the end of the buffer to perform the resampling or filtering operation. Transition length ^ ^ is computed according to ^ ^ = max ( 0, ^^ ^^ | ^^^ ( ^ )|) [0040] The total transition time ^ ^^^ is then ^ ^^^ = ^ ^^^ − ^ ^ Where ^ ^^^ is the maximum allowed transition length. It may be set to ^ ^^^ = ^, meaning that the full frame time is permitted for performing the transition. However, if ^ is large it may be desirable to limit the maximum allowed transition length using ^ ^^^ ≤ ^ to achieve a faster transition and possibly lower complexity. [0041] Figure 8 illustrates operations the ITD synthesizer 220 performs in determining the total transition length. In block 801, the ITD synthesizer 220 computes a total transition length based on a frame length ^ of the frame ^ and a lookahead memory ^^ ^^ required for resampling, the total transition divided into two parts comprising t 1 and t 2 . [0042] In block 803, the ITD synthesizer 220 determines a transition time t3 based on the lookahead memory ^^ ^^ . In block 805, the ITD synthesizer 220 determines a total transition length based on a maximum allowed transition length ^ ^^^ and the transition time t3. [0043] The following time shift operations can be divided into two groups: 1. The sign of ^^^ ( ^ ) and ^^^ ( ^ − 1 ) is the same, or one of them is zero. 2. The sign of the ITD is non-zero and changing, i.e., ^^^ ( ^ ) ∙ ^^^ ( ^ − 1 ) < 0. [0044] Case 1 – Sign of ITD is the same or one of them is zero [0045] If the sign of ^^^ ( ^ ) and ^^^ ( ^ − 1 ) is the same, or if one of them is zero, the shift may be handled by processing just one of the channels, meaning processing step 605 where the resampler 570 adjusts the processing buffer 510 to populate the output buffer A 550. This can be realized by assigning the full transition length to ^ ^ and setting ^ ^ to zero, i.e., ^ = ì ^ ^ ^^^ ^ ^ = ^ ^ − ^ = 0 ï ^^ ^ ï ^ ^ = −|^^^(^ − 1)| ^ ^^,^ = ^ ^ + | ^^^ ( ^ − 1 )| − |^^^(^)| í ^ ^ = ^ − |^^^(^)| ï ^ ï ^ ^^,^ = 0 î ^ ^ = ^ ^^^ ^ ^ denote the starting indices of each time shift segment assuming that the current input subframe starts at ^ = 0, ^ ^^,^ , ^ ^^,^ is the length of the resampling segments 1 and 2.In this case the input signal part of the processing buffer is simply copied to output buffer B 560. [0046] When the time delay of the current frame is the same as the previous frame, i.e., ^^^ ( ^ ) = ^^^ ( ^ − 1 ) , the output time delay synthesis is produced by pointing to the corresponding starting point in the processing buffer. The sign of ^^^(^) determines which of the two channels in which to apply the delay. For instance, a positive ^^^(^) could indicate that the left channel of a stereo pair is ahead of the right channel, in which case the right channel should be delayed, and the left channel be output without delay. This situation is illustrated in Figure 7. In this case, the resampling operation on output buffer A 550 has the same input and output length and is equivalent to a copy operation. [0047] When the absolute value of the time delay of the current frame is larger than the absolute value of the previous frame, |^^^(^)| > |^^^(^ − 1)|, a transition is generated to allow a smooth transition between the delay values. This situation is illustrated in Figure 9. The transition is done by extending the length of the frame from ^ ^^^ ( ^ ) , ^ = ^^^ ( ^ − 1 ) , … , ^ − 1 − ^^^ ( ^ ) of length ^ ^ + | ^^^(^ − 1) | | ^^^(^) | to an output frame of length ^ ^ . If the transition time ^ ^ is larger than zero (^. ^. , ^ ^ > 0), the last ^ ^ samples of the output channel is simply copied from the processing buffer to arrive at a delay of | ^^^(^) | , ^ ^^^ ( ^ ) , ^ = ^ − 1 − ^3 − |^^^ ( ^ ) |, … ^ − 1 − |^^^(^)|. Here, one may also note that if |^^^(^ − 1)| = |^^^(^)|, the input length is the same as the output length and the resampling would be equivalent to a copy operation. [0048] When the absolute value of the time delay decreases, i.e., |^^^ ( ^ ) | < |^^^ ( ^ − 1 ) |, the expression for the input frame length remains the same. However, the length ^ ^ + |^^^(^ − 1)| − |^^^(^)| will now be larger than the resulting length ^ ^ and the resampling corresponds to shortening the length of the frame. This is illustrated in Figure 10. In this example, ^^^(^) = 0 which means ^ ^ = ^^ ^^ and the last ^ ^ samples of the output buffer A are copied from the processing buffer 510. [0049] Case 2 – Sign of ITD is different and non-zero [0050] If the sign of the ITD is changing, i.e., if ^^^ ( ^ ) ∙ ^^^ ( ^ − 1 ) < 0, a shift operation must be done on both channels. In this case the total transition length ^ ^^^ is split into two parts according to buffers A and B are assembled using the time-shifting the signal using the transition times ^ ^ and ^ ^ as illustrated in Figure 11. First, a resampling is done of the buffer starting from ^ ^ of length ^ ^^,^ to the first ^ ^ samples of output buffer A. The last ^ ^^ − ^ ^ samples are populated by copying the remainder of the processing buffer to output buffer A. Then, the first samples of the processing buffer starting from index 0 are copied to the first ^ ^ samples of output buffer B. The next ^ ^ samples of output buffer B are created by resampling the samples starting from ^ ^ in the processing buffer of length ^ ^^,^ . Finally, the last ^ ^ samples of the processing buffer are copied to output buffer B. The last ^ ^^^^^^ samples of the input frame are stored in memory for processing the next subframe. Output buffers A and B are assigned to the output channels 0 and 1 for left and right HRIR filtering respectively. The assignment of the output channels is done based on the signs of ^^^(^) and ^^^(^ − 1) as follows, where ∧ denotes logical AND and ∨ denotes inclusive OR. The resampling operations are implemented using a polyphase filter with a sinc function from a lookup table. A benefit of dividing the transition length between the two channels is that the computational complexity of the resampler is proportional to the length of the transition, and by constraining the total transition length to ^ ^^^ the total complexity is kept below a certain limit. Further, the transition speed on left and right channels is kept roughly the same (roughly since integer rounding takes place). Since the artefacts from the transition is lower for lower transition speed, this keeps the transition artefacts at a minimum. [0051] The resampling and copying operations may also be described referring to the indices of the buffers as follows. When the ITD signs are different, the first transition is to move from ^^^ ( ^ − 1 ) to 0 on output buffer A 550 and then shift from 0 to ^^^(^) on output buffer B 560. An example of this process is illustrated in Figure 11. In step 605, the resampler 570 fills the first ^ ^ samples of output buffer A 550. This is done by resampling the samples ^ = −|^^^ ( ^ − 1 ) |, … , − 1 of length ^ ^ + |^^^ ( ^ − 1 ) | to fit into the samples ^ = 0, … , ^ ^ − 1 of output buffer A 550. In the same step, the samples ^ ^^^ ( ^ ) , ^ = 0, … , ^ ^ − 1 are copied to the corresponding indices ^ = 0, … , ^ ^ − 1 in output buffer B 560. In step 607, the resampler 580 adapts the samples ^ ^^^ ( ^ ) , ^ = ^ ^ , … , ^ ^ + ^ ^ − 1 − | ^^^ ( ^ )| of length ^ ^ | ^^^ ( ^ )| to fit into the samples ^ = ^ ^ , … , ^ ^ + ^ ^ − 1 of the length ^ ^ in output buffer B 560. In case ^ ^ > 0, the last ^ ^ samples ^ = ^ − 1 − ^ ^ − |^^^(^)|, … , ^ − 1 − |^^^(^)| are copied from the processing buffer 510 to output buffer B 560 in step 609. It should be noted that the output buffers 550 and 560 may have a different alignment such that the indices are shifted according to the processing buffer. In that case the above indices related to output buffers 550 and 560 would be shifted by that amount, but the segment still be appended in the same way as described here. For instance, output buffer A 550 in Figure 7 and Figure 9 may be offset by −|^^^(^)| to be aligned with the processing buffer indices. Further, the resampler 570 and resampler 580 may be realized using the same resampling function operating on different input. [0052] In step 611, common for Case 1 and Case 2 above, the output buffer A 550 and the output buffer B 560 are assigned to output 0 and output 1. In the intermediate buffers A and B, A always corresponds to the channel that currently has a non-zero ITD and is delayed, while buffer B corresponds to the channel that has zero ITD. The intermediate buffers simplifies the processing using these assumptions, and the output assignment is a simple step which may be done at the end to assign the processed buffers to the correct output channel. The assignment of the output buffers depends on the signs of ^^^(^ − 1) and ^^^(^) following this pseudo- code: ^ IF ^^^ ( ^ − 1 ) = 0 o IF ^^^ ( ^ ) > 0 ^ Output buffer A 550 → output 1, Output buffer B 560 → output 0 o ELSE ^ Output buffer A 550 → output 0, Output buffer B 560 → output 1 ^ ELSE o IF ^^^(^ − 1) > 0 ^ Output buffer A 550 → output 1, Output buffer B 560 → output 0 o ELSE ^ Output buffer 550 → output 0, Output buffer 560 → output 1 It may also be simplified into: where ∧ denotes logical AND and ∨ denotes inclusive OR, (^, ^) → (1,0) means assigning Output buffer A 550 to output 1 and Output buffer B 560 to output 0 and ( ^, ^ ) ( 0,1 ) means assigning Output buffer A 550 to output 0 and Output buffer B 560 to output 1. The output buffers 0 and 1 may correspond to binaural channels left and right respectively. The numbering may also be done differently, e.g. output buffers 1 and 2. [0053] In other words, if ^^^(^ − 1) is zero, the current ITD, ^^^(^), is used to determine which buffer to delay. If ^^^(^) is positive, output 0 is ahead of output 1 and output 1 should be delayed. If ^^^(^) is negative, output 1 is ahead of output 0 and output 0 should be delayed. If ^^^(^ − 1) is not zero, the previous ITD, ^^^(^ − 1), decides which buffer to shift first. If ^^^(^ − 1) is positive, output 1 is shifted first in buffer A 550, followed by a shift of output 0 in buffer B 560. If ^^^(^ − 1) is negative, output 0 is shifted first in buffer A 550, followed by a shift of output 1 in buffer B 560. It should be noted that the definition of sign of ^^^(^) may be reversed, in which case output 0 and output 1 would switch places above. [0054] It should be noted that step 611 may happen before step 605 by assigning the output buffer A 550 and output buffer B 560 already to the designated outputs 1 and 2, such that the output is populated during steps 605-611. In an embodiment, output 0 and output 1 may correspond to left and right channel respectively. [0055] Resampling with sinc function [0056] The described method relies on a resampling function to handle compressing and extending segments of the signal. This may be realized using a sinc resampling function, as illustrated in Figure 12. Given an input signal ^(^) of length ^ ^^ and an output length ^ ^^^ , the input signal may be resampled at the fractional indices ^ = 0,1,2, … , ^ ^^^ − 1 following these steps. For each ^, calculate where represents a round-down operation. [0057] Since the sinc function is computationally complex to compute, it may be desirable to store it in a table with a predefined resolution. For instance, a resolution of ^ ^^^^ = 64, meaning there are 64 samples between the zero crossings of the sinc functions may be suitable (see Figure 12). The corresponding index in the sinc table could then be found at Δ^ ^^ = Δ^^ ^^^^ + 0.5 [0058] The output values ^(^) may be found by the sum where ^ ^ ^^ ^^^^ = ^ ^ ^^^ Δ^ ^^ = Δ^^ ^^^^ + 0.5 [0059] Note that while the above embodiments were described using an audio object renderer (e.g., a decoder), the various embodiments described above could also be done at an encoder where the shifted outputs (i.e., output 0 and output 1) are shifted at the encoder instead of at the audio object renderer. [0060] Figure 13 shows an audio object renderer 114 (e.g., a decoder) in accordance with some embodiments where the audio object renderer 114 is implemented as a stand-alone device. As used herein, an audio object renderer refers to a device capable, configured, arranged and/or operable to decode encoded objects and communicate with network nodes, encoders, and/or decoders. Examples of an audio object renderer include, but are not limited to, a smart phone, mobile phone, cell phone, voice over IP (VoIP) phone, wireless local loop phone, desktop computer, personal digital assistant (PDA), wireless cameras, gaming console or device, storage device, playback appliance, wearable terminal device, wireless endpoint, mobile station, tablet, laptop, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), smart device, wireless customer-premise equipment (CPE), vehicle-mounted or vehicle embedded/integrated wireless device, etc. [0061] An audio object renderer may support device-to-device (D2D) communication, for example by implementing a 3GPP standard for sidelink communication, Dedicated Short-Range Communication (DSRC), vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), or vehicle- to-everything (V2X). In other examples, a decoder may not necessarily have a user in the sense of a human user who owns and/or operates the relevant device. [0062] The audio object renderer 114 includes processing circuitry 1302 that is operatively coupled via a bus 1304 to an input/output interface 1306, a power source 1308, a memory 1310, a communication interface 1312, and/or any other component, or any combination thereof. Certain decoders may utilize all or a subset of the components shown in Figure 13. The level of integration between the components may vary from one decoder to another decoder. Further, certain decoders may contain multiple instances of a component, such as multiple processors, memories, transceivers, transmitters, receivers, etc. [0063] The processing circuitry 1302 is configured to process instructions and data and may be configured to implement any sequential state machine operative to execute instructions stored as machine-readable computer programs in the memory 1310. The processing circuitry 1302 may be implemented as one or more hardware-implemented state machines (e.g., in discrete logic, field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc.); programmable logic together with appropriate firmware; one or more stored computer programs, general-purpose processors, such as a microprocessor or digital signal processor (DSP), together with appropriate software; or any combination of the above. For example, the processing circuitry 1302 may include multiple central processing units (CPUs). [0064] In the example, the input/output interface 1306 may be configured to provide an interface or interfaces to an input device, output device, or one or more input and/or output devices. Examples of an output device include a speaker, a sound card, a video card, a display, a monitor, an actuator, an emitter, a smartcard, another output device, or any combination thereof. An input device may allow a user to capture information into the audio object renderer 114. Examples of an input device include a touch-sensitive or presence-sensitive display, a camera (e.g., a digital camera, a digital video camera, a web camera, etc.), a microphone, a sensor, a directional pad, a trackpad, a scroll wheel, a smartcard, and the like. The presence-sensitive display may include a capacitive or resistive touch sensor to sense input from a user. A sensor may be, for instance, an accelerometer, a gyroscope, a tilt sensor, a force sensor, a magnetometer, an optical sensor, a proximity sensor, a biometric sensor, etc., or any combination thereof. An output device may use the same type of interface port as an input device. For example, a Universal Serial Bus (USB) port may be used to provide an input device and an output device. [0065] In some embodiments, the power source 1308 is structured as a battery or battery pack. Other types of power sources, such as an external power source (e.g., an electricity outlet), photovoltaic device, or power cell, may be used. The power source 1308 may further include power circuitry for delivering power from the power source 1308 itself, and/or an external power source, to the various parts of the audio object renderer 114 via input circuitry or an interface such as an electrical power cable. Delivering power may be, for example, for charging of the power source 1308. Power circuitry may perform any formatting, converting, or other modification to the power from the power source 1308 to make the power suitable for the respective components of the audio object renderer 114 to which power is supplied. [0066] The memory 1310 may be or be configured to include memory such as random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read- only memory (EEPROM), magnetic disks, optical disks, hard disks, removable cartridges, flash drives, and so forth. In one example, the memory 1310 includes one or more application programs 1314, such as an operating system, web browser application, a widget, gadget engine, or other application, and corresponding data 1316. The memory 1310 may store, for use by the audio object renderer 114, any of a variety of various operating systems or combinations of operating systems. [0067] The memory 1310 may be configured to include a number of physical drive units, such as redundant array of independent disks (RAID), flash memory, USB flash drive, external hard disk drive, thumb drive, pen drive, key drive, high-density digital versatile disc (HD-DVD) optical disc drive, internal hard disk drive, Blu-Ray optical disc drive, holographic digital data storage (HDDS) optical disc drive, external mini-dual in-line memory module (DIMM), synchronous dynamic random access memory (SDRAM), external micro-DIMM SDRAM, smartcard memory such as tamper resistant module in the form of a universal integrated circuit card (UICC) including one or more subscriber identity modules (SIMs), such as a USIM and/or ISIM, other memory, or any combination thereof. The UICC may for example be an embedded UICC (eUICC), integrated UICC (iUICC) or a removable UICC commonly known as ‘SIM card.’ The memory 1310 may allow the audio object renderer 114 to access instructions, application programs and the like, stored on transitory or non-transitory memory media, to off- load data, or to upload data. An article of manufacture, such as one utilizing a communication system may be tangibly embodied as or in the memory 1310, which may be or comprise a device-readable storage medium. [0068] The processing circuitry 1302 may be configured to communicate with an access network or other network using the communication interface 1312. The communication interface 1312 may comprise one or more communication subsystems and may include or be communicatively coupled to an antenna 1322. The communication interface 1312 may include one or more transceivers used to communicate, such as by communicating with one or more remote transceivers of another device capable of wireless communication (e.g., another UE or a network node in an access network). Each transceiver may include a transmitter 1318 and/or a receiver 1320 appropriate to provide network communications (e.g., optical, electrical, frequency allocations, and so forth). Moreover, the transmitter 1318 and receiver 1320 may be coupled to one or more antennas (e.g., antenna 1322) and may share circuit components, software or firmware, or alternatively be implemented separately. [0069] In the illustrated embodiment, communication functions of the communication interface 1312 may include cellular communication, Wi-Fi communication, LPWAN communication, data communication, voice communication, multimedia communication, short- range communications such as Bluetooth, near-field communication, location-based communication such as the use of the global positioning system (GPS) to determine a location, another like communication function, or any combination thereof. Communications may be implemented in according to one or more communication protocols and/or standards, such as IEEE 802.11, Code Division Multiplexing Access (CDMA), Wideband Code Division Multiple Access (WCDMA), GSM, LTE, New Radio (NR), UMTS, WiMax, Ethernet, transmission control protocol/internet protocol (TCP/IP), synchronous optical networking (SONET), Asynchronous Transfer Mode (ATM), QUIC, Hypertext Transfer Protocol (HTTP), and so forth. [0070] Regardless of the type of sensor, an audio object renderer may provide an output of decoded data, through its communication interface 1312, via a wireless connection to a network node. [0071] An audio object renderer, when in the form of an Internet of Things (IoT) device, may be a device for use in one or more application domains, these domains comprising, but not limited to, city wearable technology, extended industrial application and healthcare. Non- limiting examples of such an IoT device are a device which is or which is embedded in: a connected refrigerator or freezer, a TV, a connected lighting device, an electricity meter, a robot vacuum cleaner, a voice controlled smart speaker, a home security camera, a thermostat, an electrical door lock, a connected doorbell, an autonomous vehicle, a surveillance system, a weather monitoring device, a vehicle parking monitoring device, an electric vehicle charging station, a smart watch, a fitness tracker, a head-mounted display for Augmented Reality (AR) or Virtual Reality (VR), a wearable for tactile augmentation or sensory enhancement. A decoder in the form of an IoT device comprises circuitry and/or software in dependence of the intended application of the IoT device in addition to other components as described in relation to the audio object renderer 114 shown in Figure 13. [0072] Figure 14 is a block diagram of a host 1400 in accordance with various aspects described herein. As used herein, the host 1400 may be or comprise various combinations hardware and/or software, including a standalone server, a blade server, a cloud-implemented server, a distributed server, a virtual machine, container, or processing resources in a server farm. The host 1400 may provide one or more services to one or more UEs. [0073] The host 1400 includes processing circuitry 1402 that is operatively coupled via a bus 1404 to an input/output interface 1406, a network interface 1408, a power source 1410, and a memory 1412. Other components may be included in other embodiments. Features of these components may be substantially similar to those described with respect to the devices of previous figures, such as Figure 13, such that the descriptions thereof are generally applicable to the corresponding components of host 1400. [0074] The memory 1412 may include one or more computer programs including one or more host application programs 1414 and data 1416, which may include user data, e.g., data generated by a UE for the host 1400 or data generated by the host 1400 for a UE. Embodiments of the host 1400 may utilize only a subset or all of the components shown. The host application programs 1414 may be implemented in a container-based architecture and may provide support for video codecs (e.g., Versatile Video Coding (VVC), High Efficiency Video Coding (HEVC), Advanced Video Coding (AVC), MPEG, VP9) and audio codecs (e.g., EVS, IVAS, FLAC, Advanced Audio Coding (AAC), MPEG, G.711), including transcoding for multiple different classes, types, or implementations of UEs (e.g., handsets, desktop computers, wearable display systems, heads-up display systems). The host application programs 1414 may also provide for user authentication and licensing checks and may periodically report health, routes, and content availability to a central node, such as a device in or on the edge of a core network. Accordingly, the host 1400 may select and/or indicate a different host for over-the-top services for a UE. The host application programs 1414 may support various protocols, such as the HTTP Live Streaming (HLS) protocol, Real-Time Messaging Protocol (RTMP), Real-Time Streaming Protocol (RTSP), Dynamic Adaptive Streaming over HTTP (MPEG-DASH), etc. [0075] Figure 15 is a block diagram illustrating a virtualization environment 1500 in which functions implemented by some embodiments of the audio object renderer 114 or components of the audio object renderer 114 may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components. Some or all of the functions described herein may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1500 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a decoder, encoder, network node, UE, core network node, or host. Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. [0076] Applications 1502 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment 1500 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein. [0077] Hardware 1504 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 1506 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 1508A and 1508B (one or more of which may be generally referred to as VMs 1508), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein. The virtualization layer 1506 may present a virtual operating platform that appears like networking hardware to the VMs 1508. [0078] The VMs 1508 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 1506. Different embodiments of the instance of a virtual appliance 1502 may be implemented on one or more of VMs 1508, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment. [0079] In the context of NFV, a VM 1508 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Each of the VMs 1508, and that part of hardware 1504 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 1508 on top of the hardware 1504 and corresponds to the application 1502. [0080] Hardware 1504 may be implemented in a standalone network node with generic or specific components. Hardware 1504 may implement some functions via virtualization. Alternatively, hardware 1504 may be part of a larger cluster of hardware (e.g., such as in a data center or CPE) where many hardware nodes work together and are managed via management and orchestration 1510, which, among others, oversees lifecycle management of applications 1502. In some embodiments, hardware 1504 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 1512 which may alternatively be used for communication between hardware nodes and radio units. [0081] Although the computing devices described herein (e.g., decoders, audio object renderers, encoders, hosts) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware. [0082] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer- readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device-readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer- readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally. [0083] Example embodiments 1. A method in an inter-channel time difference, ITD, synthesizer (220, 340, 1502) comprising: receiving (401) a current ITD and audio frames, wherein each frame ^ comprises ^ samples; storing (403) at least a part of the current input audio frame in signal memory; determining (405) transition times t1, t2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on an inter-channel time difference, ITD, of the current input audio frame and an ITD of a previous input audio frame; and applying (407) the time shift within the transition times t1, t2 determined in generation of the output signal 0 and the output signal 1. 2. The method of Embodiment 1, wherein the audio frames are part of an object audio signal with position metadata describing a position relative to a listener, the method further comprising obtaining the time shift from the position metadata. 3. The method of any of Embodiments 1-2, further comprising: computing (801) a total transition length based on a frame length of the frame ^ and a lookahead memory ^^ ^^ required for resampling, the total transition divided into two parts comprising t1 and t2 determining (803) a buffer length t3 based the lookahead memory ^^ ^^ ; and determining (805) a total transition length based on a maximum allowed transition length and the buffer length t3 4. The method of Embodiment 3, wherein determining the buffer length t 3 comprises determining the buffer length t3 according to: ^ ^ = max ( 0, ^^ ^^ | ^^^ ( ^ − 1 )|) and determining the total transition length according to: ^ ^^^ = ^ ^^^ − ^ ^ wherein ^ ^^^ ≤ ^ 5. The method of any of Embodiments 1-4, wherein determining the transition times t1, t2 comprises: responsive to the sign of the current ITD and the previous ITD are the same, assigning the total transition length to one of the transition times t1, t2 and setting the other one to zero. 6. The method of any of Embodiments 1-4, wherein determining the transition times t 1 , t 2 comprises: responsive to the sign of the current ITD and a sign of the previous ITD being different, applying a shift operation on both output signal 0 and output signal 1 by splitting the total transition length into two parts to determine the transition times t 1 , t 2 . 7. The method of Embodiment 6, wherein splitting the total transition length into two parts to determine the transition times t1, t2 comprises splitting the total transition length according to: where [ ] represents a rounding operation to the nearest integer. 8. The method of any of Embodiments 3-7, further comprising: populating a processing buffer (510) using a current input audio frame of the object audio and signal memory; wherein applying the transition times t1, t2 determined in generation of the output signal 0 and output signal 1 comprises: responsive to the sign of the current ITD and the sign of the previous ITD are the same or responsive to one of the current ITD and the previous ITD is zero: adjusting the processing buffer (510) to populate a first output buffer (550) by assigning the total transition length to t 1 and setting t 2 to zero; and copying an input signal part of the processing buffer (510) to a second output buffer (560). 9. The method of Embodiment 8, wherein applying the transition times ^ ^ , ^ ^ determined in generation of the output signal 0 and output signal 1 further comprises: responsive to ^^^(^) = ^^^(^ − 1) and a sign of one of the current ITD and the previous ITD is negative, thereby indicating that one of the output signal 0 and output signal 1 is ahead of the other of the output signal 0 and the output signal 1, delaying an output buffer comprising whichever one of the first output buffer (550) or the second output buffer (560) is associated with the other of the output signal 0 and output signal 1 by the total transition length. 10. The method of any of Embodiments 8-9, wherein applying the transition times^ ^ , ^ ^ determined in generation of the output signal 0 and output signal 1 further comprises: responsive to | ^^^ ( ^ )| > | ^^^ ( ^ − 1 )|, generating a transition by: extending the length of the frame in the processing buffer (510) from length ^ ^ + |^^^(^ − 1) | | ^^^(^) | to an output frame of length ^ ^ ; and responsive to the buffer length ^ ^ is larger than zero, adding the last ^ ^ samples of the output channel by copying from the processing buffer (510), ^ ^^^ ( ^ ) , ^ = ^ − 1 − ^3 − | ^^^ ( ^ )| , … ^ − 1 − | ^^^ ( ^ )| . 11. The method of any of Embodiments 8-10, wherein applying the transition times ^ ^ , ^ ^ determined in generation of the output signal 0 and output signal 1 further comprises: responsive to |^^^(^)| < |^^^(^ − 1)|, adding the last t_3 samples of the first output buffer (550) by copying from the processing buffer (510). 12. The method of any of Embodiments 8-11, wherein applying the transition times ^ ^ , ^ ^ determined in generation of the output signal 0 and output signal 1 further comprises: responsive to ^^^ ( ^ ) ∙ ^^^ ( ^ − 1 ) < 0, splitting the total transition length according to where [ ] represents a rounding operation to a nearest integer, where splitting the total transition length comprises: resampling samples − 1 of length ^ ^ + |^^^ ( ^ − 1 )| to fit into samples ^ = − 1 of the first output buffer (550); copying samples ^ = 0, … , ^ ^ − 1 to corresponding indices in the second output buffer (560); adapting samples ^ ^^^ (^), ^ = ^ ^ , … , ^ ^ + ^ ^ − 1 − |^^^(^)| of length ^ ^ − |^^^ ( ^ )| to fit into the samples ^ = ^ ^ , … , ^ ^ + ^ ^ − 1 of the length ^ ^ in the second output buffer (560). 13. The method of any of Embodiments 8-12, wherein applying the transition times t 1 , t 2 determined in generation of the output signal 0 and output signal 1 further comprises: responsive to ^^^ ( ^ − 1 ) = 0 and ^^^ ( ^ ) > 0, assigning the first output buffer (550) to output signal 1 and the second output buffer (560) to output signal 0; responsive to ^^^ ( ^ − 1 ) = 0 and ^^^ ( ^ ) ≤ 0, assigning the first output buffer (550) to output signal 0 and the second output buffer (560) to output signal 1; responsive to ^^^ ( ^ − 1 ) > 0, assigning the first output buffer (550) to output signal 1 and the second output buffer (560) to output signal 0; and responsive to ^^^(^ − 1) < 0, assigning the first output buffer (550) to output signal 0 and the second output buffer (560) to output signal 1. 14. An apparatus (114, 1502) having an ITD synthesizer adapted to: receiving (401) a current ITD and audio frames, wherein each frame ^ comprises ^ samples; storing (403) at least a part of the current input audio frame in signal memory; determining (405) transition times t 1 , t 2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on an inter-channel time difference, ITD, of the current input audio frame and an ITD of a previous input audio frame; and applying (407) the time shift within the transition times t1, t2 determined in generation of the output signal 0 and the output signal 1. 15. The apparatus (114, 300, 1502) of Embodiment 14, wherein the ITD synthesizer (220, 340, 1502) is further adapted to perform according to any of Embodiments 2-12. 16. An apparatus (114, 300, 1502) having an inter-channel time difference, ITD, synthesizer (220, 340, 1502) comprising: processing circuitry (1202); and memory (1210) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the ITD synthesizer (220, 340, 1502) to perform operations comprising: receiving (401) a current ITD and audio frames, wherein each frame ^ comprises ^ samples; storing (403) at least a part of the current input audio frame in signal memory; determining (405) transition times t1, t2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on an inter-channel time difference, ITD, of the current input audio frame and an ITD of a previous input audio frame; and applying (407) the time shift within the transition times t1, t2 determined in generation of the output signal 0 and the output signal 1. 17. The apparatus (114, 300, 1502) of Embodiment 16 wherein the memory includes further instructions that when executed by the processing circuitry causes the ITD synthesizer (220, 340, 1502) perform according to any of Embodiments 2-13. 18. A computer program comprising program code to be executed by processing circuitry (1202) of an apparatus (112, 300, 1502) having an inter-channel time difference, ITD, synthesizer (220, 340, 1502) whereby execution of the program code causes the ITD synthesizer (220, 340, 1502) to perform operations comprising: receiving (401) a current ITD and audio frames, wherein each frame ^ comprises ^ samples; storing (403) at least a part of the current input audio frame in signal memory; determining (405) transition times t1, t2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on an inter-channel time difference, ITD, of the current input audio frame and an ITD of a previous input audio frame; and applying (407) the time shift within the transition times t1, t2 determined in generation of the output signal 0 and the output signal 1. 19. The computer program of Embodiment 18 comprising further program code, whereby execution of the program code causes the ITD synthesizer (220, 340, 1502) to perform according to any of Embodiments 2-13. 20. A computer program product comprising a non-transitory storage medium including program code to be executed by processing circuitry (1202) of an apparatus (114, 300, 1502) having an inter-channel time difference, ITD, synthesizer (220, 340, 1502) whereby execution of the program code causes the ITD synthesizer (220, 340, 1502) to perform operations comprising: receiving (401) a current ITD and audio frames, wherein each frame ^ comprises ^ samples; storing (403) at least a part of the current input audio frame in signal memory; determining (405) transition times t1, t2 to perform a time shift to apply to at least one of an output signal 0 and an output signal 1 based on an inter-channel time difference, ITD, of the current input audio frame and an ITD of a previous input audio frame; and applying (407) the time shift within the transition times t1, t2 determined in generation of the output signal 0 and the output signal 1. 21. The computer program of Embodiment 19, wherein the non-transitory storage medium includes further program code, whereby execution of the program code causes the ITD synthesizer (220, 340, 1502) to perform according to any of Embodiments 2-13.