A SYSTEM FOR TRANSMITTING LOW LATENCY, SYNCHRONISED AUDIO

Title:

A SYSTEM FOR TRANSMITTING LOW LATENCY, SYNCHRONISED AUDIO

Document Type and Number:

WIPO Patent Application WO/2016/030694

Kind Code:

Abstract:

A system for transmitting low latency, synchronised audio that includes an audio source, a processor, a controller and a sink zone with a DAC. Particularly, the processor is capable of selectively resampling the audio source in order to output a data packet for transmission to the sink zone that has a maximised payload size while packet frequency remains a whole number.

More Like This:

JP2000059360	INFORMATION ACQUISITION METHOD IN NETWORK MANAGEMENT SYSTEM AND ITS DEVICE
WO/2011/147074	METHOD, SYSTEM AND CORRESPONDING APPARATUS FOR IMPLEMENTING POLICY AND CHARGING CONTROL
WO/2007/093846	COMMUNICATION CONNECTION CONTROL SYSTEMS AND METHODS

Inventors:

CUBITT DOMINIC (GB)

Application Number:

PCT/GB2015/052501

Publication Date:

March 03, 2016

Filing Date:

August 28, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

LODE AUDIO LTD (GB)

International Classes:

H04L12/24; G10L19/16; H04J3/06; H04L29/06

Other References:

CARÔT ALEXANDER ET AL: "Network Music Performance (NMP) in Narrow Band Networks", AES CONVENTION 120; MAY 2006, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 1 May 2006 (2006-05-01), XP040507618
KURTISI Z ET AL: "Using wavpack for real-time audio coding in interactive applications", MULTIMEDIA AND EXPO, 2008 IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 23 June 2008 (2008-06-23), pages 1381 - 1384, XP031312988, ISBN: 978-1-4244-2570-9
SHENOI KISHAN ET AL: "Synchronizing multi-media streams", 18 April 2013 (2013-04-18), XP055230116, Retrieved from the Internet [retrieved on 20151120]

Attorney, Agent or Firm:

HIGGS, Jonathan (Altius House1 North Fourth Street,Milton Keynes, MK9 1NE, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

A system for transmitting low latency, synchronised audio, including:

an audio source;

a processor;

a controller;

a sink zone with a DAC;

wherein the processor is capable of selectively resampling the audio source in order to output a data packet for transmission to the sink zone that has a maximised payload size while packet frequency remains a whole number.

The system of claim 1 further including a media player.

The system of claim 1 or 2 wherein the audio source is resampled to 48 kHz, unless the audio source is already at 48 kHz.

The system of claim 3 wherein the data packet has 128 frames with a packet frequency of 375Hz and payload size of 594.

The system of claim 2 wherein the media player is a single or multi zone media player.

The system of any preceding claim wherein there are multiple sink zones.

The system of claim 6 wherein a common clock time exists across the multiple DACs of the sink zones, achieved via implementation of a protocol such as Network Time Protocol (NTP) and Precision Time Protocol (PTP).

The system of any preceding claim wherein the data packet includes a packet header and data.

9. The system of claim 8 wherein the packet header includes information pertaining to sequence, presentation time and data length.

10. The system of claim 9 wherein the presentation time is determined according to a fixed delay or measuring the minimum, maximum and average durations between packet arrival and its intended delivery time and sending this information back to the processor so that a presentation delay can be optimized for the system in question.

11. The system of any preceding claim wherein adjustments are made to stretch and sync the data packets to maintain synchronization.

12. The system of any preceding claim wherein DAC clock rate is addressed by resampling the presentation frame in accordance with the DAC's buffer size to be slightly longer or shorter depending on how full the audio buffer is thereby addressing audio clock drift across different hardware audio clocks.

13. The system of any preceding claim, utilising standards compliant IP adapters.

14. The system of any preceding claim wherein the processor includes or is in communication with a data storage means capable of storing the audio source.

15. The system of claim 2 wherein the media player is a CD player.

16. The system of any preceding claim including an analogue sound source where the processor encodes a digital stream for transmission as a data packet.

17. A method for transmitting low latency, synchronised audio including the step of taking an audio source input, decoding it if necessary into RAW PCM and upsampling it to 48KHz to produce an audio output for packetization by a packetizer.

18. The method of claim 17 wherein the audio output is fed into a buffer at a rate faster than real time. 19. The method of claim 18 wherein audio output is transmitted from the buffer over Internet Protocol.

20. The method of claim 19 wherein transmission jitter is reduced by providing a high-resolution chronographic library, utilizing hardware timers, so that the packetizer can take a chunk of audio from the buffer, convert it into a packet and transmit it.

21. The method of any preceding claim 17 to 20 wherein the timing resolution tolerance is below 0.001ms and at a rate of 375kHz.

22. The method of any preceding claim 17 to 21 wherein IP transmission is based on UDP Multicast optionally augmented with both point-to-point UDP and TCP. 23. The method of any preceding claim 17 to 22 wherein a depacketizer is responsible for receiving the transmitted data and converting it into packets placed into a buffer ahead of a presentation time encoded into said packet.

24. The method of claim 23 wherein a high resolution timer takes the packets out of the buffer and presents them into the DAC hardware buffer at the allotted presentation time.

25. The method of any preceding claim 17 to 24 wherein resampling is optionally performed where the audio clock rate does not exactly match real time, said resampling intended to stretch, or shrink the audio packet slightly so that the audio is consistently delivered.

Description:

A System for Transmitting Low Latency, Synchronised Audio

Technical Field The present invention relates to a system for transmitting low latency, synchronised audio, particularly using the Internet Protocol (IP). Implementation of the invention is suited to the custom install industry wherein an audio system is installed into a house with loud speakers located in multiple rooms, i.e. "multi-zone", intended to be accessed and/or controlled remotely or at a central location.

Background to the Invention

Multi-zone audio systems have been possible for some time, e.g. utilising traditional analogue wired transmission to enable communication between components. Implementing such a system can be complex and time consuming depending on the ease with which copper wires can be installed between rooms and around a house.

It is clearly advantageous to apply modern "wireless" communication techniques in order to transmit audio from a source (place where the audio comes from) to a sink (place where the audio goes to), without the need for copper wires everywhere. However, producing an effective successor to copper wires requires a number of rather difficult problems to be solved, as copper wires have several desirable qualities. In particular these are:

• Low Latency: the transmission of sound over copper happens very quickly, e.g. there is no perceptible delay between a television's speech/audio in a distant loud speaker and its displayed image.

• Synchronization: as a by-product of extremely low latency there is the advantage of synchronization over multiple sinks, i.e. you can play music to many places (e.g. rooms) simultaneously and 'in time'.

These characteristics may seem trivial in the analogue world but in the digital world they are challenging problems to solve. Nevertheless, there are many digital qualities that are highly desirable, once these problems are solved, that the analogue equivalent would find difficult. For example:

• Very long distance transmission becomes easy.

• Literally hundreds of wires can be replaced with zero or one wire, e.g. wireless WiFi or IP over one copper wire.

Available systems that operate based on digital technology would appear to provide many possible solutions; however, each has limitations and are not directly suitable for the custom install (CI) market that the present invention is best adapted to and intended for. Prior art systems include:

• Cobra Net: an extremely expensive solution that lacks the flexibility to be licensed in a way suitable for the CI market. Requires proprietary IP hardware. Used mainly in studio and live concert environments where cost is less of an issue. Will not work over WiFi.

• Dante: Another expensive product, used mainly in the broadcast space.

Requires proprietary IP hardware. No Wifi.

• AVB: More affordable but still requires proprietary IP Hardware and proprietary network switching. No Wifi.

• Sonos: Proprietary solution. Causes difficulty in a managed network environment, due to protocol complexity and legacy implementation. High Latency.

• Pulse Audio (RTP): Open source, software only solution. High latency. Poor synchronization.

Summary of the Invention

The present invention seeks to provide a system which is closed source, network/transmission agnostic, low latency, synchronized, works over WiFi, works over existing IP networks with a standard switch. The invention is intended to provide much more than audio over IP - it is a full audio fabric, capable of switching and joining. IP, as the primary protocol in the Internet layer of the Internet protocol suite, has the task of delivering packets from a source host to a destination host solely based on the IP addresses in the packet headers. For this purpose, IP defines packet structures that encapsulate the data to be delivered. It also defines addressing methods that are used to label a datagram with source and destination information. Preferably the present invention will utilise the known Internet Protocol for transmission and receipt of data streams. In one broad aspect of the invention there is provided a system for transmitting low latency, synchronised audio, according to the appended claims.

Preferably the media player includes or is in communication with a data storage device.

While the invention is proprietary hardware, the underlying IP adapters are standards compliant and indeed commodity components. This provides the advantage of digital transmission neutrality. It is essential for the chosen market that standard networking equipment is used, as the industry will not accept the massive increase in cost of proprietary packet switching let alone the re-training required.

Brief Description of the Drawings

Figure 1 illustrates an example of system architecture according to the invention;

Figure 2 illustrates an example of the minimum components in a system according to the invention;

Figure illustrated a pictorial view of a data packet processed according to the invention; and Figure 4 illustrates the processing sequence of raw audio according to the invention.

Detailed Description of the Preferred Embodiment

The invention has many and varied applications to provide nodes on an audio fabric, e.g. an audio bridge. Such a system is shown in Figure 1.

As can be seen in figure 1, a node on the system fabric works in concert with many other components to provide a coherent system. In general, a single node on its own has no meaning for a user. The collection of components; sources, sinks, matrixes and players become more than the sum of their parts.

According to Figure 1, a "Lode Net" (proprietary name relating to the invention) media player is a single or multi zone media player that will transmit a digital music collection or digital streaming service into an audio fabric. What this means is that a user can dial in a favourite artist, and play that anywhere there is a sink. For example, a user can instruct the media player, via an iPad application, to transmit a favourite song into the room that they are currently in. This would be achieved by another component that automatically handles the routing of the sound based on where you are. The key element here is the routing.

A system according to the invention replaces traditional line level RCA and balanced XLR wired connections, as well as going a step further providing matrix 'η:η' switching between sources and sinks.

Referring to Figure 2, a minimum system requirement may be considered a player (source), a zone (sink) and a controller. These components may be packaged as a single unit, as 3 separate units, or any combination thereof. As previously mentioned, two specific areas of difficulty arise when replacing traditional analogue connections with a digital network: latency and synchronization. We must consider each of these problems in turn.

First there is the problem of latency which causes a number of issues. Essentially we are talking about driving a remote audio buffer over a transmission with high jitter and different characteristics.

Here we need to look at the characteristics of each system and find the common ones in order to find a coherent solution.

There is a clear disparity between the two systems we are trying to integrate and to further complicate matters we have, in many cases, a CD quality 44.1kHz 16bit input source.

According to the invention, what we need is a delivery payload that is delivered at precise timing intervals whose byte size is a power of 2, which can be chunked from a CD quality source, and is smaller than the 1500 MTU minus the necessary padding required for transmission.

To make this clearer, consider each of these motivations, constraints and requirements of an example of the invention in practice: • Payload must be less than MTU: This is so that the underlying network will not split the transmitted packet into many chunks causing packet fragmentation and negatively affect performance. The default MUT is 1500 bytes, however we must subtract from this as the packet (depending on type) will require header and padding information. We will allow 64 bytes for the transmission padding to cover both main IP protocols as well as allowing room for manoeuvre. Additionally it is desirable to have a proprietary packet header that is 18 bytes in length. So the maximum payload size becomes 1422 bytes.

· Payload size must be a power of 2: This constraint arises because we eventually have to convert the sound back to analogue. All buffers for Digital to Analogue Converters (DAC) work in powers of 2. Essentially what we are trying to achieve here is a transport neutral way of remotely driving an audio buffer for a DAC.

By way of background, a frame is considered one sample of data. So for CD it would be 4 bytes: 2 channels and 2 bytes per channel. The protocol payload - the packet or message minus the header - the remaining data, is a collection of frames that must be a power of 2.

• Source Frequency must be divisible by payload frequency, i.e. a whole number: In one notable case the source is CD quality digital information. 44.1kHz at 16 bit. The system must be able to transmit this. The payload frequency is a function of the payload size and source frequency must be a whole number.

A chunk of data (the payload) to be transmitted, must be at a particular frequency. For example if we have a frame size of 22050, we must transmit this twice per second for CD quality. The problem comes with 44.1kHz when you look at factoring it into a smaller power of 2. For example, if you wanted a payload size of 128 frames, it would have to be transmitted at an irregular interval of 344.53125 times per second. This is highly undesirable/difficult for the hardware to achieve. Therefore, preferably, the payload frequency is a function of the payload size and source frequency must be a whole number.

• Payload buffer delivery at precise timing: As we wish to have low latency, we must deliver the payload to the DAC buffer within 1 / (4 * packet frequency) to ensure the audio is uninterrupted. Note: This is an implementation constraint, not a protocol constraint, and is achieved in our case by patching the kernel with real-time extensions that provide us with such timing constraints.

The formula to derive the payload size (I) can be given by:

1 = t + p + (- _fcr) where: t transmission overhead (64)

p protocol overhead (18)

s sample frequency

f packet frequency

c channels (stereo = 2)

r bytes per frame

Bytes per frame is a function of transmission resolution. At the moment it is proposed to use 32 bit integer, so it is 4, motivated by guilt free soft volume control. However, it is probable the invention will move to a floating point transmission representation as, further down the pipeline, a DSP filter chain has been implemented within the zone/sink, and floating point makes clipping easier to deal with as a result of using two pole shelving filters for parametric equalisation. Using our current criteria we have the following situation, where the first payload size is calculated as:

I = 64 + 18 + (44.1k/22.05k * 2 * 2) = 90

Table 1: Comparative Payload Size for 44.1kHz CD Quality Audio

Referring to Table 1, while there are a couple of potential payload sizes that meet our protocol requirements, the system pressure to deliver packets at such precise timings is considerable - 0.01ms and 0.02ms for payload frames 2 and 4 respectively. While this is theoretically possible, it is unlikely to provide a robust solution for the market. A hardware timer has to be generated internally to deliver the payload to the DAC. Current hardware is otherwise not very good at doing this. For a payload size of 2 the timer would need to deliver the packet every 1/22050 of a second - about 0.00004s or 0.4ms. With the aforementioned Payload buffer delivery at precise timing at a quarter of this - 0.1ms and given the internal jitter of timers, an order of magnitude higher resolution would be required to reliably perform this task - 0.01ms. This is not a precise calculation but gives an indication of the sort of accuracy required...

Clearly the problematic constraint here is the source frequency needing to be divisible by the payload frequency. CD quality digital sound at 44.1 kHz is perfect for the human ear (according to Nyquist) and represents a minimal representation to that end, however it remains mathematically difficult to manipulate. However, if we resample the sound to 48kHz, a frequency much easier to manipulate mathematically, we gain a much broader number of solutions that present much less pressure onto the system.

Table 2: Comparative Payload Size for Resampled Audio at 48kHz Here we have many more solutions that all present a more relaxed system pressure and make for a more robust overall solution. Typically we would use a payload of 128 frames with a packet frequency of 375Hz and size of 594 bytes. Now we have a reasonable solution for latency, let us look at the synchronization issue: the problem here is twofold. Firstly the audio payload must be delivered to the different DACs at precisely the same time. Secondly, as different DACs have different clocks that are slightly out of sync, adjustments must be made to stretch and sync the packets to maintain synchronization.

The problem of simultaneous payload delivery requires that a common clock time exists across the different DACs. This problem has long been solved with protocols such as Network Time Protocol (NTP) and more accurately Precision Time Protocol (PTP). For this reason, this problem domain is not addressed here.

Now that we have a common concept of time across the different pieces of hardware, we can tackle the problem of packet synchronization. Here, we use a presentation timestamp within the packet header, given at some point in the future, allowing for packet transmission and jitter as well as a little buffering.

The next question is how much time in the future the presentation timestamp should be made. One solution is to use a fixed delay. The advantage is that it is very predictable, while the disadvantage is, in order to be robust, the lowest common denominator of delay must be used, negatively affecting the overall latency. Another approach is to measure the minimum, maximum and average durations between the packet arrival and its intended delivery time and send this information back to the audio sender so that the target presentation delay can be optimized for the network in question. Next we have to cater for different DACs having slightly different clock rates. This is partially dealt with by virtue of a theoretical presentation timestamp relative to the input sampling frequency and can be completed by resampling the presentation frame in accordance with the DACs buffer size. We know if we have 2 frames in the buffer and we know the buffer size at presentation time, we can resample the current frame to be slightly longer or shorter depending on how full the audio buffer is. This takes care of the audio clock drift across different hardware audio clocks.

The underlying protocol according to the invention is specified according to Figure 3. Referring to Figure 3, the data types are as follows:

Implementation of the invention can be broadly broken down into a series of functional components. A simplified illustration is illustrated by Figure 4.

According to the invention, the decoder/upsampler is responsible for taking as its input, digital audio, decoding it if necessary into RAW PCM, and upsampling it to 48KHz. It is then fed into a buffer (preferably at a faster than real time rate!).

The next step is to take the audio out of the buffer and transmit this, which in the preferred form, is over IP. One of the key things is to try and reduce transmission jitter as much as possible. Reducing jitter leads to reduce latency. To address this a high-resolution chronographic library, utilizing hardware timers, has been written so that the packetizer can take a chunk of audio from the buffer, convert it into a packet and transmit it, within a timing resolution tolerance below 0.001ms and at a rate of 375KHz (in the exemplified case). While every effort is made to transmit the IP packet at precise intervals, other network components and traffic will lead to packet jitter and as such; the packet arrival cannot be relied upon at all. At this point, the packet has been transmitted into the Audio Fabric. This (at present) is based over IP and uses several transmission strategies. UDP (User Datagram Protocol) and TCP (Transmission Control Protocol) are both used as well as UDP Multicast. This is configurable based upon the desired architecture.

For a residential custom installation UDP Multicast has been selected as the preferable option, but this will be augmented with both point-to-point UDP and TCP. Both UDP and TCP are deliverable over IP.

The depacketizer is responsible for receiving the transmitted data from the network and converting it into packets according to the invention. It places these packets into a buffer (hopefully ahead of presentation time). Again, a high resolution timer takes these packets out of the buffer and presents them into the DAC hardware buffer and the allotted presentation time, however, slight additional resampling may sometimes be required as the audio clock rate does not exactly match real time. This resampling will stretch, or shrink the audio packet slightly so that the audio is consistently delivered.

The invention is not a software only solution. The software aspect will only work for the specific hardware components it is tuned to. As such, the invention is provided as hardware with the capabilities described. The system is essentially considered firmware rather than software, and is delivered to an embedded system.

Referring to the above example embodiment of the invention, if the music is already at 48kHz, there is no need to resample. The reference to "media player" in this case is not quite accurate since it does not really play the music, it just transmits it, regardless of if the receiving zone is on the same machine or a remote machine.

It is well known that "HD" audio might be 24/32 bit and 48/96/192kHz. It is clearly envisaged by the invention that the processor may downsample the available audio or otherwise convert it to 16 bit 48kHz in order to be transmitted by IP and meeting the conventional MUT requirements.

The system of the invention is, overall, a configurable framework where the provided example is merely one set of currently specified (default) values. The installer could be able to change these settings according to requirements.

For a residential system installation, a set of options is chosen so the installer will not need to worry about this. However, for a professional range of products, operating parameters will be configurable and as such the system will recalculate.

It is eventually intended to provide many products to the industry to solve many different problems the industry faces and incorporate these with the invention or as standalone solutions; ranging from an audio bridge, to take sound from a TV and broadcast into Lode Net as illustrated, through to a matrix, that switches audio sinks on Lode Net to various broadcast sources on Lode Net (like the aforementioned TV bridge), through to a full multi-zone media player, controllable directly via an iPad/tablet/smartphone interface, or indirectly through a control system such as Crestron or AMX.

Previous Patent: A SECURING DEVICE

Next Patent: AN OIL INSULATED ROTATIONAL DRIVE