Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD FOR EMBEDDING OR DECODING AUDIO PAYLOAD IN AUDIO CONTENT
Document Type and Number:
WIPO Patent Application WO/2023/212753
Kind Code:
A1
Abstract:
A method for embedding audio payload in audio content and a method for decoding audio payload embedded in audio content, a method for detecting transients. Using complementary spreading sequences in direct spread spectrum watermarking or alternatively using multi-level spreading sequences applied to modulate the amplitude spectrum of the audio content to embed the payload by spreading the payload bits. Decoding watermark bits based on correlation sign or decoding by obtaining a number of windows used during encoding to combine the amplitude spectra into one window for calculating the correlation with at least spreading sequence. Encoding and decoding a payload by using multiple windows combined with different polarities as a block to transport the same payload.

Inventors:
PAMINDER BRAR (GB)
Application Number:
PCT/AT2022/060151
Publication Date:
November 09, 2023
Filing Date:
May 02, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MEDIATEST RES GMBH (AT)
International Classes:
G10L19/025; G10L19/018
Foreign References:
US20140129011A12014-05-08
US20100280641A12010-11-04
EP1310099A22003-05-14
EP2362387A12011-08-31
Other References:
HENRIQUE S MALVAR ET AL: "An improved spread spectrum technique for robust watermarking", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, 13 May 2002 (2002-05-13), pages IV - 3301, XP032015544, ISBN: 978-0-7803-7402-7, DOI: 10.1109/ICASSP.2002.5745359
LITAO GANG ET AL: "MP3 resistant oblivious steganography", 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). SALT LAKE CITY, UT, MAY 7 - 11, 2001; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, vol. 3, 7 May 2001 (2001-05-07), pages 1365 - 1368, XP010803146, ISBN: 978-0-7803-7041-8, DOI: 10.1109/ICASSP.2001.941182
BANOCI VLADIMIR ET AL: "2D - Spread spectrum watermark framework for multimedia copyright protection", 2014 24TH INTERNATIONAL CONFERENCE RADIOELEKTRONIKA, IEEE, 15 April 2014 (2014-04-15), pages 1 - 4, XP032605705, ISBN: 978-1-4799-3714-1, [retrieved on 20140609], DOI: 10.1109/RADIOELEK.2014.6828466
Attorney, Agent or Firm:
TORGGLER & HOFMANN PATENTANWÄLTE GMBH & CO KG (AT)
Download PDF:
Claims:
CLAIMS Method for encoding at least one audio payload, in particular a watermark, in audio content, comprising at least the steps of: providing audio content into which at least one audio payload is to be embedded in the form of an amplitude spectrum in the frequency domain providing at least one audio payload in the form of a binary sequence wherein each bit of the binary sequence has a first value or a second value applying at least one spreading sequence to the binary sequence to obtain at least one spreading binary sequence, such that

• if a bit of the binary sequence is of the first value, the bit is spread by the at least one spreading sequence, and

• if a bit of the binary sequence is of the second value, the bit is spread by the negative of the at least one spreading sequence using the at least one spreading binary sequence to modulate the amplitude spectrum of the audio content in the frequency domain to embed the at least one audio payload into the audio content. Method for encoding at least one audio payload, in particular a watermark, in audio content, in particular according to the preceding claim, comprising at least the steps of: providing audio content, into which at least one audio payload is to be embedded, in the form of an amplitude spectrum in the frequency domain providing at least one audio payload in the form of a binary sequence applying a first spreading sequence to the binary sequence to obtain a first level spreading binary sequence applying at least one more spreading sequence different from the first spreading sequence to the first level spreading binary sequence to obtain at least one more spreading binary sequence of a higher level using the at least one more spreading binary sequence of a higher level to modulate the amplitude spectrum of the audio content to embed the at least one audio payload into the audio content. Method according to both preceding claims, wherein with respect to at least two different levels of spreading, preferably with respect to every level, the step of applying at least one spreading sequence of that level to the binary sequence of the previous level to obtain the at least one spreading binary sequence of a higher level, is done such that: if a bit of the binary sequence of the previous level is of the first value, the bit is spread by the at least one spreading sequence of that level, and if a bit of the binary sequence of the previous level is of the second value, the bit is spread by the negative of the at least one spreading sequence of that level Method according to one of the two preceding claims, wherein the step of applying at least one more spreading sequence different from the first spreading sequence to the first level spreading binary sequence to obtain at least one more spreading binary sequence of a higher level includes at least: providing a number of different spreading sequences applying a spreading sequence chosen from the number of spreading sequences to obtain a second level spreading binary sequence choosing a further spreading sequence from the number of spreading sequences and applying the further spreading sequence to obtain a third level spreading binary sequence repeating the previous step for a number of times, the number of times being equal to or larger than zero, until a highest-level spreading binary sequence is obtained and wherein the step of using the at least one more spreading binary sequence to modulate an amplitude spectrum of the audio content to embed the at least one audio payload into the audio content includes at least using the highest-level spreading binary sequence to modulate the amplitude spectrum of the audio content in the frequency domain to embed the at least one audio payload into the audio content Method according to at least one of the preceding claims wherein the amplitude spectrum is fragmented into audio signal windows by applying a windowing transform into the frequency domain to the amplitude spectrum, wherein it is preferably provided that audio signal windows containing transients are encoded with less payload strength or are skipped. Method according to the preceding claim wherein several consecutive windows are combined into blocks, such that windows in one block contain the same binary sequence wherein, preferably, each window in a block is assigned a polarity and spectra of windows of one block are added in accordance with their polarity value, wherein it is preferred that the polarity of the windows is chosen according to the selected Barker sequence. Method for decoding at least one audio payload, in particular a watermark, from audio content, comprising at least the steps of: providing audio content into which at least one audio payload was embedded in the form of an amplitude spectrum in the frequency domain obtaining the at least one spreading binary sequence that was used to modulate the amplitude spectrum of the audio content in the frequency domain for embedding the at least one audio payload into the audio content calculating at least one correlation coefficient between

• at least part of the amplitude spectrum of the audio content, and

• the at least one spreading binary sequence, or its negative, and in dependence on the sign of the at least one correlation coefficient determining whether an embedded bit of the at least one audio payload is of the first value or the second value thereby obtaining the value of the embedded bit of the embedded audio payload. Method according to the preceding claim wherein: the step of obtaining the at least one spreading binary sequence that was used to modulate the amplitude spectrum of the audio content in the frequency domain for embedding the at least one audio payload into the audio content comprises obtaining all spreading binary sequences of different levels which were used to obtain the highest-level spreading binary sequence the step of calculating at least one correlation coefficient comprises calculating correlation coefficients between

• at least part of the amplitude spectrum of the audio content, and

• the highest-level spreading binary sequence, or its negative thereby obtaining a sequence of highest-level correlation coefficients, and for each lower level calculating at least one correlation coefficient between the sequence of correlation coefficients of the higher level and the spreading binary sequence of the lower level until the lowest level has been reached Method for decoding at least one audio payload, in particular a watermark, from audio content, comprising at least the steps of: obtaining the at least one more spreading binary sequence that was used to modulate the amplitude spectrum of the audio content for embedding the at least one audio payload into the audio content obtaining information about the number of windows used during encoding of the at least one audio payload using a decoder window to read a block of the audio content into which the at least one audio payload has been embedded dividing the block in the decoder window into a number of windows which is greater than the number of windows used during encoding of the at least one audio payload calculating for each window its amplitude spectrum combining the obtained amplitude spectra into one window calculating at least one correlation between the at least one more spreading binary sequence and the combined amplitude spectra thereby obtaining the embedded audio payload. Method according to the preceding claim wherein: there are provided at least three different spreading sequences that have been used to modulate the amplitude spectrum of the audio content to embed the at least one audio payload into the audio content for each of the provided spreading sequences the correlation with the combined amplitude spectra is calculated the calculated correlations are combined into one new sequence of length N • L1 - L2 ’ ■■■ ’ Lm-1 iterating the process m times to obtain a sequence of length N Method according to one of the two preceding claims wherein before the step of calculating at least one correlation, windows polarities are determined, preferably by all possible permutations in the spreading sequences that were used for encoding, wherein it is preferably provided that the amplitude spectrum of the windows is cropped to an expanding sequence of length N ■ ■ L2 • ... ■ Lm. Method for detecting transients comprising checking each window of an audio payload, in particular an incoming audio stream, before applying a method according to at least one of the preceding claims wherein each window is split into fragments and a transient is discovered by discovering a change in a pre-determined property of the audio payload between adjacent fragments. Computer program which when the program is executed by a computer causes the computer to carry out a method according to at least one of the preceding claims. A data carrier signal carrying the computer program of the preceding claim or audio content into which at least one audio payload has been embedded according to a method of at least one of claims 1 to 6. An encoder, a decoder or a transient detector being configured, respectively, to encode at least one audio payload in audio content according to a method of at least one of claims 1 to 6 or to decode audio payload from audio content according to a method of at least one of claims 7 to 11 or to detect transients according to the method of claim 12.

Description:
A METHOD FOR EMBEDDING OR DECODING AUDIO PAYLOAD IN AUDIO CONTENT

TECHNICAL FIELD

[1] In one aspect the invention is in the field of encoding at least one audio payload in audio content. In another aspect the invention relates to decoding at least one audio payload from audio content. In yet another aspect the invention relates to detection of transients in audio content.

BACKGROUND

[2] The need for embedding audio payload (e.g., a watermark) in audio content such that the audio payload is not perceived by a listener, but the audio payload is robust against intentional manipulations or disturbances is well established in the art (see by way of example EP 2 362 387 Al).

[3] It is also known to use spread spectrum techniques for watermarking as described, e.g., in Malvar et al., "An improved spread spectrum technique for robust watermarking", (DOI: 10.1109/ICASSP.2002.1004617) and Ullah et al., "Improving the security level in direct sequence spread spectrum using dual codes", International Journal of Security and Its Applications, Vol. 6 No. 2, April 2012.

[4] Known techniques do not create audio payload embedded in audio content which is robust when broadcasting through the airspace. With respect to decoding, presynchronization between sender and receiver is necessary.

[5] Although redundant coding techniques are known in the art primarily for digital data transmission methods there is still need for improvement as known techniques are adapted to correct burst bit errors and are poorly suited for encoding audio payload in audio content. SUMMARY OF INVENTION

[6] It is an object of the invention to provide a method of encoding at least one audio payload in audio content in a way which is robust when broadcasting through the airspace.

[7] It is another object of the invention to provide a method of decoding at least one audio payload from audio content, in particular without the need to perform pre-synchronization (this means that detection can start at any time mark of the audio content).

[8] It is another object of the invention to provide a method for detecting transients in audio content.

[9] Yet another object of the disclosure relates to a computer program which, when the program is executed by a computer, causes the computer to carry out any of the methods described in this disclosure.

[10] Still another object of the disclosure relates to a data carrier signal carrying such a computer program or carrying audio content into which at least one audio payload has been embedded according to any of the methods described in this disclosure.

[11] Yet another object of the disclosure relates to an encoder, a decoder and a transient detector being configured, respectively, to encode or decode at least one audio payload in audio content according to a method as described in this disclosure or to detect transients in audio content.

[12] Still other objects and advantages of the invention will in part be obvious and will in part be apparent from the specification and drawings.

[13] In this disclosure the term "binary sequence" is understood to refer to a sequence of bits wherein each bit has a first value or a second value. Usually, the first value is denoted as "1" or "0" and the second value is denoted as "0" or "1", however, other representations are possible. [14] Audio content is sound in a digital format.

First variant of a method for encoding audio payload in audio content:

[15] One object of the disclosure relates to a method for encoding at least one audio payload in audio content, comprising at least the steps of:

- providing audio content into which at least one audio payload is to be embedded in the form of an amplitude spectrum in the frequency domain

- providing at least one audio payload in the form of a binary sequence wherein each bit of the binary sequence has a first value or a second value

- applying at least one spreading sequence to the binary sequence (which represents the at least one audio payload) to obtain at least one spreading binary sequence, such that

• if a bit of the binary sequence is of the first value, the bit is spread by the at least one spreading sequence, and

• if a bit of the binary sequence is of the second value, the bit is spread by the negative of the at least one spreading sequence

- using the at least one spreading binary sequence to modulate the amplitude spectrum of the audio content in the frequency domain to embed the at least one audio payload into the audio content

[16] Of course, the audio content can be transformed back into the time domain using the inverse transform that was used to obtain the amplitude spectrum in the frequency domain.

[17] During decoding of an audio payload which has been encoded by this method, it is possible to determine the bits of the audio payload simply by calculating at least one correlation coefficient between: at least part of the amplitude spectrum of the audio content, and the at least one spreading binary sequence, or its negative

It is determined in dependence on the sign of the correlation coefficient whether an embedded bit of the at least one audio payload is of the first value or the second value thereby obtaining the value of the embedded bit of the embedded audio payload. [18] Simple numerical example 1:

[19] Audio payload in the form of a one-bit binary sequence is to be embedded into audio content with the random spreading sequence [0 1 0 0] or, if zeros are replaced with -1, with the spreading sequence SS_ONE = [-1 +1 -1 -1]. If the payload were bit 0, the same spreading sequence would be used but with reversed signs SS_ZERO = [+1 -1 +1 +1]. Audio payload of more than one bit can be treated analogously for each bit.

[20] The audio content (in this example called "SIGNAL") is in the form of an amplitude spectrum in the frequency domain, by way of example in the following audio content with only four frequency bins is used: SIGNAL = [2 2 4 1], The different positions in the vector signify the different frequency bins, while the numbers signify the amplitude of the signal in each frequency bin.

[21] Using the spreading binary sequence SS_ONE to modulate the amplitude spectrum of the audio content SIGNAL in the frequency domain to embed the at least one audio payload (i.e., in this example the single bit "1" spread into SS_ONE) into the audio content, one obtains audio content in the frequency domain with embedded audio payload SIGNAL_WM:

SIGNAL_WM = SIGNAL + SS_ONE, i.e., SIGNAL_WM = [1 3 3 0]

[22] The SIGNAL_WM can be transformed back into the time domain to obtain audio content with embedded audio signal that can be directly listened to.

Second variant of a method for encoding audio content:

[23] Another object of the disclosure relates to a method for encoding at least one audio payload in audio content, comprising at least the steps of: - obtaining audio content into which at least one audio payload is to be embedded in the form of an amplitude spectrum

- providing at least one audio payload in the form of a binary sequence

- applying a first spreading sequence to the binary sequence (which represents the at least one audio payload) to obtain a first level spreading binary sequence

- applying at least one more spreading sequence (which is, of course, different from the first spreading sequence and can have equal length or different length) to the first level spreading binary sequence to obtain at least one more spreading binary sequence of a higher level

- using the at least one more spreading binary sequence of a higher level to modulate the amplitude spectrum of the audio content to embed the at least one audio payload into the audio content

[24] It was found that multiple expansion of audio payload with shorter spreading sequences is more efficient than expansion by use of one long spreading sequence. That is, if the length of the audio payload is N, and the length of the first spreading sequence is L ± (at the lowest level), then the length of the first spreading sequence will be equal to N ■ L ± . The resulting sequence (with length N ■ L^) can be extended at a second, higher, level in a similar way with a new spreading sequence of length L 2 . Thus, a new sequence of length N ■ L 1 - L 2 will be obtained. Continuing this process m times, finishing with a highest-level spreading sequence of length L m , a sequence of length N ■ L 1 - L 2 - ... ■ L m will be obtained.

[25] Simple numerical example 2:

[26] Audio payload WM consists of three bits such that WM = [1 0 1], For value "1" the first level spreading sequence [1 -1] is used, for value "0" the first level spreading sequence [-1 1] is used such that a first level spreading binary sequence equal to

[1 -1 -1 1 1 -1] is obtained.

[27] In a next step, for value "1" the second level spreading sequence [1 -1 1] is used, for value "-1" the second level spreading sequence [-1 1 -1] is used such that a spreading sequence of higher (and, in this example, highest) level equal to [1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 1 -1 1 -1 1 -1] is obtained.

[28] This highest spreading sequence is used to modulate the amplitude spectrum of the audio content in the frequency range (in this example, with respect to eighteen frequency bins), e.g., by increasing (i.e. multiplying, in the simplest case, addition can be used as in example 1) the amplitude of a frequency bin, if the corresponding value of the highest-level spreading sequence is "1" and decreasing (i.e., dividing) the amplitude of a frequency bin, if the corresponding value of the highest-level spreading sequence is "-1".

[29] In practice, two or three levels of spreading sequences are usually enough.

[30] Of course, the first variant and the second variant of a method for encoding at least one audio payload in audio content can be combined, as shown in the following numerical example.

[31] Simple numerical example 3:

[32] Two more spreading sequences of a higher level (level 2) are given, one for value "1" and one for value "0":

SS_ONE_2 = [+1 -1 +1] SS ZER0 2 = [-1 +1 -1]

[33] Bit 1, at level 1, is spread to [-1 +1 -1 -1], and, at level 2, is spread to

[-1 +1 -1 +1 -1 +1 -1 +1 -1 -1 +1 -1] which can be hidden in twelve bins of frequency (called in short Bi ... B12).

[34] Both variants for a method for encoding are based on a spread spectrum technique in frequency space. At least one spreading sequence is used for spreading the binary sequence representing the at least one audio payload. In the second method, at least two spreading sequences are used instead of one long spreading sequence. [35] Some of the advantages of both variants for a method for encoding are the possibility of dealing with large binary data sequences, the generation of audio content provided with embedded payload which can be decoded after having been sent over air and resistance to noise interference and compression techniques. Both variants for a method for encoding can be used for live encoding.

[36] Both variants for a method for encoding use a redundancy coding technique that is adapted to the numerous interference that occurs when audio signals representing audio content are transmitted through airspace and compression.

[37] Both variants for a method for encoding are scalable due to the ability to use different frequency ranges, changing the number of levels and lengths of the spreading sequences. Due to this, it is possible to achieve the necessary compromise between payload size and recognition quality.

[38] Both variants for a method for encoding allow for large payload size and short detection time.

[39] Both variants for a method for encoding can be used directly to protect original audio content from copying and to facilitate market research. Moreover, this technique allows to add metadata to audio content and synchronize devices through the acoustic channel.

[40] It should be noted that the audio payload can be in any form such as a watermark, metadata, or the like. A watermark is understood to be a unique electronic identifier to be embedded in audio content.

[41] It should further be noted that, although both variants for a method for encoding described herein are particularly suited for audio content that is sent through air, they are not limited to this transmission channel. Payload recognition can also be performed by directly reading a file. [42] Embedding of audio payload occurs in the frequency domain of the audio content. This provides secrecy, as well as reliability against interference and reverberation when transmitted through an acoustic data channel.

[43] Embedding audio payload (in the following sometimes called exemplary "watermarking") means changing the spectrum of the audio content. The spectrum changes are preferably carried out using a short-time window method. The advantages of this technique are predictable computational complexity, as well as the presence of a DSP block in FPGA systems, which allows developing highly efficient hardware solutions.

[44] The source information of the audio payload for encoding must be represented in binary form.

[45] Preferably, a fast Fourier transform (FFT) and its inverse (IFFT), are used to transform signals between the time domain and the frequency domain.

First variant for a method for decoding audio payload:

[46] Another object of the disclosure relates to a method for decoding at least one audio payload from audio content, comprising at least the steps of:

- providing audio content into which at least one audio payload was embedded in the form of an amplitude spectrum in the frequency domain

- obtaining the at least one spreading binary sequence that was used to modulate the amplitude spectrum of the audio content in the frequency domain for embedding the at least one audio payload into the audio content

- calculating at least one correlation coefficient between

• at least part of the amplitude spectrum of the audio content, and

• the at least one spreading binary sequence, or its negative, and in dependence on the sign of the at least one correlation coefficient determining whether an embedded bit of the at least one audio payload is of the first value or the second value thereby obtaining the value of the embedded bit of the embedded audio payload [47] By way of example, the at least one spreading binary sequence could be obtained starting from a seed or key value by using a common random number generator or pseudorandom number generator.

[48] Simple numerical example 4:

[49] Getting back to numerical example 1, it is possible to calculate a correlation coefficient (e.g., Pearson correlation) with both spreading sequences: cor(SS_ONE, SIGNAL_WM) = 0.56 > 0, therefore value "1" is embedded cor(SS_ZERO, SIGNAL_WM) = -0.56 < 0, therefore value "0" is embedded

[50] The point is that it is not necessary to use two separate spreading sequences, one for "0" and one for "1". It is sufficient to check only for the first one (or only for the second one), and by the sign of the correlation it can be immediately understood which bit will insert a "1" or "0".

[51] In this numerical example, the level of p-value is at 44,4 %, which is a lot. To accept the hypothesis that there is not a random correlation, but a correlation caused by the presence of a watermark, the p-value should be less than a given number, e.g., 5 %.

[52] In case, two levels of spreading sequences are used, it is possible to calculate correlation coefficients at each level until one obtains a number the sign of which can be determined at the lowest level, e.g., as shown in the following simple numerical example 5 which uses the results of numerical example 3.

[53] Simple numerical example 5

[54] First step at level 2: cor([Bi, B2, B3], ss_one_2) = Fi cor([B4, B5, Be], ss_one_2) = F2 cor([B?, Bs, B9], ss_one_2) = F3 cor([Bw, Bn, B12], ss_one_2) = F4

[55] Second step at level 1: cor( [Fi, F2, F3, F4], ss_one_l) > 0 thefore final_bit = 1 co r( [Fi, F2, F3, F4], ss_one_l) < 0 thefore fi na l_bit = 0

Second variant for a method for decoding audio payload:

[56] Yet another object of the disclosure relates to a method for decoding at least one audio payload from audio content, comprising at least the steps of:

- obtaining the at least one more spreading binary sequence that was used to modulate the amplitude spectrum of the audio content for embedding the at least one audio payload into the audio content

- obtaining information about the number of windows used during encoding of the at least one audio payload (window is a slice of data points in the time domain, the length of which corresponds to the applied window function)

- using a decoder window to read a block of the audio content into which the at least one audio payload has been embedded

- dividing the block in the decoder window into a number of windows which is greater than the number of windows used during encoding of the at least one audio payload

- calculating for each window its amplitude spectrum

- combining the obtained amplitude spectra into one window

- calculating at least one correlation between the at least one more spreading binary sequence and the combined amplitude spectra thereby obtaining the embedded audio payload

[57] Both variants allow for blind detection of audio payload without the need for performing pre-synchronization.

Method for detection of transients: [58] Another object of the disclosure relates to a method for detecting transients comprising checking each window of an audio content, in particular an incoming audio stream, before applying a method to encode an audio payload as described above, wherein each window is split into fragments and a transient is discovered by discovering a change in a predetermined property of the audio content between adjacent fragments.

DESCRIPTION OF EMBODIMENTS

Embodiments of first and second method for encoding audio payload in audio content:

[59] In the frequency domain, to preserve sound quality, encoding audio payload with a strength depending on the frequency range can be used. It is possible to use an absolute hearing threshold curve as described in ISO 226 to determine allowable strength of the audio payload depending on the frequency range.

[60] The source information for the audio payload must be represented in the form of a binary sequence (i.e., a sequence of 0 and 1 or -1 and 1 or any other representation having two different symbols for encoding two different values, in the following the symbols 0 and 1 or -1 and 1 will be used without any intended restriction) for encoding. This binary sequence represents the audio payload. Direct sequence spread spectrum technique implies that each bit of the audio payload is assigned a spreading sequence to obtain a spreading binary sequence by replacing each bit of the audio payload with the corresponding spreading sequence.

[61] In some embodiments at least one spreading sequence, preferably several or all of the spreading sequences, is a pseudo-random sequence (pn-sequence). By way of example, Barker sequences can be used.

[62] After replacing each bit of the audio payload with the corresponding spreading sequence, the spreading binary sequence is obtained, in which the audio payload is encrypted. The spreading binary sequence is used to modulate the amplitude spectrum of the original audio content. Thus, the audio payload is hidden in the audio content.

[63] In a preferred embodiment of the encoding method the step of applying at least one more spreading sequence different from the first spreading sequence to the first spreading binary sequence to obtain at least one more spreading binary sequence of a higher level includes at least:

- providing a number of different spreading sequences

- applying a spreading sequence chosen from the number of spreading sequences to obtain a second level spreading binary sequence

- choosing a further spreading sequence from the number of spreading sequences and applying the further spreading sequence to obtain a third level spreading binary sequence

- repeating the previous step for a number of times, the number of times being equal to or larger than zero, until a highest-level spreading binary sequence is obtained and wherein the step of using the at least one more spreading binary sequence to modulate an amplitude spectrum of the audio content to embed the at least one audio payload into the audio content includes at least using the highest-level spreading binary sequence to modulate the amplitude spectrum of the audio content in the frequency domain to embed the at least one audio payload into the audio content.

[64] In other words, although it is sufficient for the invention to use as little as one spreading sequence, the use of at least two different spreading sequences is preferred for spreading. In the prior art, one long spreading sequence is used.

[65] In some embodiments, the step of using the at least one spreading binary sequence to modulate an amplitude spectrum of the audio content to embed the at least one audio payload into the audio content includes at least:

- in case a bit of the at least one binary sequence is of a first value increasing the amplitude corresponding to the bit of the at least one more binary sequence

- in case a bit of the at least one binary sequence is of a second value reducing the amplitude corresponding to the bit of the at least one more binary sequence [66] Preferably, a given number of the largest amplitudes of the amplitude spectrum of the audio signal is not modulated to avoid formation of beats.

[67] One aspect of the present invention is the use (preferably at each stage) for expanding the original bit sequence of only two sequences wherein, with respect to at least two different levels of spreading, preferably with respect to every level, the step of applying at least one spreading sequence of that level to the binary sequence of the previous level to obtain at least one spreading binary sequence of a higher level, is done such that:

- if a bit of the binary sequence of the previous level is of the first value, the bit is spread by the spreading sequence, and

- if a bit of the binary sequence of the previous level is of the second value, the bit is spread by the negative of the spreading sequence

[68] For example, the first spreading sequence corresponds to a bit with value "1", the second spreading sequence to the opposite value "0". The spreading sequences are obtained from one another by changing the values of the bits to the opposite. Thus, in the detection it is necessary to use (at each level) only (without loss of generality) the first or the second spreading sequence. The value of the audio payload bit (at each step) will be hidden in the correlation sign. The correlation module to some extent determines the probability of a type 1 error in a bit at the current level.

[69] Combining this technique with the use of several spreading sequences at different levels instead of a single spreading sequence (so that there is only a single level), in some embodiments, it is provided that the step of applying at least one more spreading sequence different from the first spreading sequence to the first level spreading binary sequence to obtain at least one more spreading binary sequence of a higher level includes at least:

- providing a number of different spreading sequences

- applying a spreading sequence chosen from the number of spreading sequences to obtain a second level spreading binary sequence

- choosing a further spreading sequence from the number of spreading sequences and applying the further spreading sequence to obtain a third level spreading binary sequence - repeating the previous step for a number of times, the number of times being equal to or larger than zero, until a highest-level spreading binary sequence is obtained and wherein the step of using the at least one more spreading binary sequence to modulate an amplitude spectrum of the audio content to embed the at least one audio payload into the audio content includes at least using the highest-level spreading binary sequence to modulate the amplitude spectrum of the audio content in the frequency domain to embed the at least one audio payload into the audio content

[70] The audio content into which the audio payload is to be embedded can be broken down into windows which do or do not overlap, and the method can be applied to each window. A window is a fragment of audio content whose length matches the size of a window function.

[71] In other words, in some embodiments, it is provided that the amplitude spectrum is fragmented into audio signal windows by applying a windowing transform into the frequency domain to the amplitude spectrum, wherein it is preferably provided that audio signal windows containing transients are encoded with less payload strength or are skipped.

[72] In some embodiments overlapping windows are used. The step of overlapping windows not only avoids artifacts that occur at the junctions of windows, but also changes the spectrum of audio content continuously throughout its length. This makes blind detection of audio payloads without prior synchronization easier.

[73] It can be provided that several consecutive windows are combined into blocks, such that windows in one block contain the same binary sequence wherein, preferably, each window in a block is assigned a polarity and spectra of windows of one block are added in accordance with their polarity value.

[74] After modulating the spectrum of a current window and returning to the time domain, the formation of various audible artifacts is possible. The greatest danger is represented by windows in which there are transient processes. They contribute to the formation of preechoes in the watermarked audio signal. Therefore, the use of a transient detector is preferred as described below. [75] A preliminary synchronization can be done when the detector window starts its movement from the same position from which the window was taken to introduce the watermark in the encoder. That is, if discrete samples are numbered as "1 2 3 45 6 78 9 ..." and a window size of four samples is used, then window 4 then the encoder produced the following breakdown (with overlap): (1234) (3456) (5678) ... If the detector window accepts (2345) this is recognition without prior synchronization. If the decoder does run a synchronization algorithm first and selects (1234) this is pre-synchronized. Synchronization is an additional cost of computing resources. Therefore, it should be abandoned.

[76] If the detector moves along the track of the audio content with a smaller step than the window size, the detector receives windows: (1234) (2345) (3456) ... These windows are aggregated, and their analysis gives a mixture of good and bad windows. But bad windows do not give wrong watermarks, they just give less correlation, and their joint analysis smooths out errors.

[77] It is preferred to use a strong overlap (e.g., 50 %), so the spectrum of the entire track changes evenly and there is no difference from which position to start recognition.

[78] In some embodiments the amplitude spectrum of the audio content is the result of a short-term Fourier transform, preferably with about 50 % overlap.

[79] In these embodiments it is preferred to choose a length L FFT of the amplitude spectrum such that 0.5 ■ L FFT + 1 ■ L 2 • ... ■ L m , N being the size of the binary sequence and

L 2 , ... L m being the sizes of the applied spreading sequences, wherein it is preferably provided that in case a size N ■ n m of a spreading sequence is less than 0.5 ■ L FFT + 1 these frequency bins will be ignored during watermarking and during detection.

[80] In some embodiments the step of using the at least one more spreading binary sequence to modulate an amplitude spectrum of the audio content to embed the at least one audio payload into the audio content includes at least: - in case an element of the at least one more binary sequence is of a first type increasing the amplitude corresponding to the element of the at least one more binary sequence

- in case an element of the at least one more binary sequence is of a second type reducing the amplitude corresponding to the element of the at least one more binary sequence

[81] In these embodiments it is preferred not to modulate a given number of the largest amplitudes of the amplitude spectrum of the audio signal to avoid formation of beats.

[82] In some embodiments it is preferred to combine several consecutive windows into blocks such that windows in one block contain the same binary sequence wherein, preferably, each window in a block is assigned a polarity and spectra of windows of one block are added in accordance with their polarity value. The polarity of a window is the value of the coefficient "+1" or "-1", by which the spreading sequence is multiplied before modulation. Windows of one block are transformed into one window, in which the strength of the payload increases in proportion to the number of windows in the block. In other words, combining windows into one block is a duplication of the payload over time. Due to this, decoding can be performed successfully, even if several windows in the block were skipped by the decoder or were not initially encoded due to the triggering of a transient detector.

Embodiments of method for detection of transients:

[83] A "transient" is understood to be either a sudden change in the volume of an audio signal, such as a castanet or drum solo, or a sudden change in the dominant frequency of a signal, such as a guitar playing.

[84] As described above, an audio signal can be broken into windows. To detect a window of an audio signal containing transients, a method for detecting transients (transient detector) can be used. The transient detector checks each window of the incoming audio stream before feeding it to the encoder. [85] To detect such windows, a transient detector can be used, which analyzes the spectrum of each window. As a result of the analysis, the detector informs the encoder about the presence of transients in the windows.

[86] The transient detector may have different implementations:

[87] In the simplest case, the incoming audio signal window is split into non-intersecting (sometimes overlapping) fragments. Then the energy of each fragment is calculated. Next, the change in energy from window to window is analyzed. Abrupt changes in the resulting energy change curve are characterized by a transient process. More complex implementations come down to constructing the envelope of an audio signal and its further analysis.

[88] The preferred implementation of the transient detector is to split the original window into fragments and attempt to predict each subsequent fragment spectrum based on the previous spectrum. Sharp and unpredictable changes in the phase and amplitude spectrum indicate the presence of transient processes in the window.

Embodiments of first and second method for decoding audio payload:

[89] Generally speaking, there are two preferred embodiments for decoding:

- combining windows into one window (the polarity of windows is preferably chosen in accordance with the Barker sequences)

- increasing frequency resolution by selecting a window in the detector with a multiple of the dimensions of the encoding window

[90] In some embodiments of the decoding method:

- there are provided at least three different spreading sequences that have been used to modulate the amplitude spectrum of the audio content to embed the at least one audio payload into the audio content

- for each of the provided spreading sequences the correlation with the combined amplitude spectra is calculated - the calculated correlations are combined into one new sequence

- iterating the process m times to obtain a sequence of length N

[91] In these embodiments it can be provided that before the step of calculating correlations, windows polarities are determined by all possible permutations in the spreading sequences that were used for encoding wherein it is preferably provided that the amplitude spectrum of the windows is cropped to an expanding sequence of length N ■ L 1 - L2 ' ■■■ ' m -

[92] Applying audio payload can degrade the quality of audio content. In particular, when encoding audio signals with a dominant frequency in the mid or high range of audible frequencies (where the watermark hides) such as ringing a bell, triangle, or solo on stringed instruments, audible reverberations can occur (or intensify). They are due to beats that result from the addition of two harmonics with similar frequencies and amplitudes.

[93] For robust detection of payloads in an audio signal, several consecutive windows are combined into blocks. Windows in one block contain the same payload. Each window in the block is assigned a polarity. The polarity of a window is the value of the coefficient "+1" or 1", by which the spreading sequence is multiplied before modulation. Stability is achieved by adding the spectra of windows of one block in accordance with their polarity value. Thus, windows of one block are transformed into one window, in which the strength of the audio payload increases in proportion to the number of windows in the block. In other words, combining windows into one block is a duplication of the audio payload over time. Due to this, decoding can be performed successfully, even if several windows in the block were skipped by the decoder or were not initially encoded due to the triggering of the transient detector.

[94] During the decoding process, the detector window reads a piece of audio content, the length of which is equal to the length of the block. The step of shifting the detector window is set to some value 6 ■ L FFT where L FFT denotes FFT length and 6 denotes part of the length. Thus, it is guaranteed that, earlier or later, the entire block will appear in the detector window with an accuracy of 6 ■ L FFT . It has been found that this is sufficient to successfully decode the audio payload. For greater reliability of detection, blocks are also sequentially duplicated. It was found that in the case when the detector window contains windows from blocks with different audio payloads, the decoding result depends on the ratio of windows. Thus, the lower bound for synchronization accuracy can be chosen to be half the block length.

[95] It has been found that decoding is more likely to be successful if the decoder window boundaries coincide with the window boundaries. However, absolute timing accuracy is not required when using this method. Synchronization accurate to 6 ■ L FFT is sufficient for successful payload recognition.

[96] Upon detection, the polarity of each window from the block is unknown. It is also unknown whether the analyzed windows belong to the same block. Therefore, a complete enumeration of all possible linear combinations of windows is carried out. In order to select from the set of all linear combinations the one with which the encoding was performed, the polarity of the windows in the block is determined in accordance with the Barker sequence. In this case, false combinations will have a minimum level in relation to the desired one.

[97] It has been found that audio payloads are more stable when the spectra are subtracted, rather than when they are added. Therefore, out of two Barker sequences of the same length, to determine the polarity of windows in a block, it is better to choose the one that contains more negative values than positive ones. It is to be mentioned that initially in the encoder the polarities of the blocks can be selected in accordance with the Barker sequence (Bl). The consequence of this is a complete enumeration of possible options in the decoder in accordance with the choice in encoder Bl. if the length of Bl when encoding was equal to seven, then in the decoder one has to go through seven options.

[98] The detector reads audio data at intervals several times shorter than the window length. The smaller the interval, the more detailed the track is examined - the more likely it is to detect audio payload, but the greater the computational load. [99] The decision whether the detected audio payload is true can be made up of the following two components: The first is the correlation value that was achieved when the audio payload was detected. The second is the number of repetitions of the audio payload, for a given period of time - the decay time. The audio payload can be stored in memory for no more than the decay time and then deleted.

[100] With a small detector step, sequentially read windows are more likely to give the same audio payload, simply because they contain the same elements, and are very similar. In order to avoid an increase in weight by the second factor - the number of repetitions, it is necessary to count the number of repetitions for audio payloads found not one after another, but at a certain safe distance.

[101] The standard scheme is when the polarity of the windows is as follows:

{+-} {+-}... That is, the sequence 10 or 01 is used to determine the polarity. It is the idea of this embodiment that, in fact, longer sequences produce a better effect. Moreover, it is best to choose Barker sequences for this purpose and where there are more zeros, that is, subtractions. For example, from {++} and {-+}, the second one is preferable.

[102] Incorrectly decoded audio payloads can be detected if the correlation values are below a certain threshold. The correlation threshold is chosen from a compromise between decoding accuracy and recognition time. To increase the reliability of decoding, links with short spreading sequences can be introduced between individual bits of the audio payload. In this way, the integrity of the audio payload is improved. If a part of the audio payload is decoded incorrectly, then subsequent bits will be decoded by applying incorrect spreading sequences. Thus, the audio payload will not be decoded at the cost of decreasing the correlation value.

[103] To ensure reliability, redundant coding can be applied to the audio payload in binary form. A checksum is added, e.g., CRC (cyclic redundancy check). After decoding, the checksum is verified. Audio payloads that do not pass the test are rejected.

[104] Simple numerical example 6: [105] Suppose during encoding windows with a length of 4096 were used. During decoding windows with a length of 8192 are used, however not every frequency bin will be used, but only every second frequency bin. This improves recognition of an embedded audio payload because information from two windows is accumulated into one window.

[106] Even better results can be achieved if during decoding windows of a length equal to three or four or x times the length of the encoding windows are used, and only every third, fourth, x frequency bin is used.

[107] In some embodiments the following technique can be used instead of or in addition to the technique described above:

[108] During decoding, the window size is chosen to be several times larger than the window size used during encoding. This results in a higher frequency resolution. It is clear that the frequency bins must be chosen so that the frequencies corresponding to them correspond to those that were chosen during coding. Due to the large window size, there is several times more information per frequency bin, since it is collected from several windows. Due to this, a greater correlation is achieved, and recognition is improved.

BRIEF DESCRIPTION OF DRAWINGS

[109] Figure 1 shows a block schematic diagram of a system having a transient detector and an encoder according to an embodiment of the invention.

[110] Figure 2 shows a block schematic diagram of a decoder according to an embodiment of the invention.

[111] Figure 3 shows an embodiment of a scheme for constructing different spreading sequences. [112] Figure 4 shows the structure of an example of audio payload.

[113] In Figure 1, a system having a transient detector and an encoder according to the invention is shown. Although this is not shown, the encoder is, of course, provided with at least one electronic processing unit, electronic memory, a bus system and all other electronic components common for a computer. The transient detector, although a separate logical entity, can be provided as a configuration of the same hardware which is configured to act as the encoder, or a separate hardware can be provided.

[114] The encoder is provided with the output of a module computing a Fast Fourier Transformation (FFT) which is provided with the audio content into which the audio payload is to be embedded as described above. The encoder provides the audio content with embedded audio payload to a sender which broadcasts this signal over the airwaves. As shown, an optional psychoacoustical processing module can be provided to preserve sound quality by enabling the encoder to encode the audio payload with a strength depending on the frequency range, e.g., using an absolute hearing threshold curve as described in ISO 226 to determine allowable strength of the audio payload depending on the frequency range.

[115] In Figure 2, a decoder according to the invention is shown. Although this is not shown, the decoder is, of course, provided with at least one electronic processing unit, electronic memory, a bus system and all other electronic components common for a computer.

[116] Audio content with embedded audio payload sent over the airwaves is received by a receiver and transmitted to the decoder which extracts the audio payload from the audio content as described below. The audio content itself can be transmitted, e.g., to a loudspeaker.

[117] It should be noted that Figure 1 shows different logical components as blocks. It is of course possible that each block corresponds to a physical piece of hardware.

Alternatively, some of the different blocks or even all of them could be combined into one physical piece of hardware. The same holds true with respect to Figure 2. [118] In the following, based on Figures 3 and 4, a preferred embodiment of a method for encoding at least one audio payload, in particular a watermark, in audio content is discussed, as well as a method for decoding at least one audio payload, in particular a watermark, from audio content.

[119] The embodiment described herein uses multiple spreadings of the audio payload with short sequences which is found to result in more robust encoding than a single spreading using one long one spreading sequence as suggested in the art. That is, if the length (i.e., the one-dimensional size) of the audio payload is N, and the length of the first pn-sequence is Li, then the length of the first spreading binary sequence will be equal to N ■ ii-

[120] In a preferred embodiment of the invention the bits with the value "1" should be spread with the direct sequence, and the bits with the value "0" with the same sequence, but with the opposite sign. The resulting sequence (with length N ■ is spread with a new pn- sequence of length 1.2. Thus, a new spreading binary sequence of length N ■ ■ L 2 is obtained. Continuing this process m times, a spreading binary sequence of length N ■ L 1 - L 2 - ... ■ L m is obtained. For stable detection of audio payload, m should be at least two or three.

[121] As a result, the amplitude spectrum of the audio signal will be directly modulated by sequences of length L m , which will be repeated N ■ L 1 - L 2 - ... ■ L m -i times and differ only in signs (L m and —1 ■ L m ). To increase the resistance of the payload to various bandpass filters and equalizers, as well as cryptographic strength, L m sequences are used to modulate the spectrum in a pseudo-random order. This order can be determined by a secret key for decoding the audio payload. Also, this key can be used to generate short pn- sequences.

[122] The amplitude spectrum can be obtained, in particular, as a result of a shortterm Fourier Transform with 50 % overlap. An audio signal window is a fragment of a long spectrum to which a windowing transform into the frequency domain is applied. In this case, due to symmetry, the length of the spectrum LFFT must satisfy the following condition: 0.5 ■ L FFT + 1 > N ■ L ■ L 2 ■ ... ■ L m (we further assume that LFFT is always even, moreover, for fast Fourier transform, one should take LFFT = 2 k , k = 1, 2...). In the case where the length of the spreading sequence is less than 0.5 ■ L FFT + 1 the spreading sequence is adjusted to a work range of frequency bins as long as the inequality remains true.. This occurs both at the beginning and at the end of the sequence to capture frequency bins that correspond to the 600 Hz to 13 kHz midrange, which typically forms the basis of a sound or musical composition.

[123] After adjusting the spreading sequence to a length of 0.5 ■ L FFT + 1, the amplitude spectrum is modulated.

[124] Modulation can be carried out as follows. The amplitude spectrum is preferably converted to a semi-logarithmic scale. If the element is equal to "1", then the amplitude of the spectrum increases (linear scale - multiplies) by a certain gain a - the strength of the audio payload, lif the element is "-1", then the amplitude of the spectrum is reduced by a. Thus, the window is the minimum length of the audio signal fragment, which completely contains the spreading sequence, and, consequently, the audio payload.

[125] The decoding process can be similar to the encoding process.

[126] Windows can be used such that a block in the decoder window is divided into windows. For each window, its amplitude spectrum is calculated.

[127] The obtained spectra can be combined into one window. Window polarities can be determined by all possible permutations in the pn-sequence (e.g., Barker sequence) that was used for encoding.

[128] The spectrum of the windows can be cropped to an expanding sequence of length

[129] Next, the correlations between the spreading sequence and the spectrum can be found. The correlation results are combined into one new sequence of length N ■ L 1 - L 2 - ... ■ L m-1 . After m iterations, a sequence of length N is obtained. Each value of the sequence is a correlation value. If in the received sequence the value is greater than zero it is 1, if it is less than zero it is 0.